Cloud Native Computing Foundation CNCF Webinars, 7 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webinar: Optimize your Kubernetes Clusters on Azure with Built-in Best Practices

Description

Containers allow developers to build and deploy applications efficiently, but managing containerized apps with Kubernetes clusters can be complex. Watch this webinar to learn how Azure solutions combined with CNCF projects can automatically identify, optimizations, efficiencies as well as potential issues and provide actionable recommendations for your Kubernetes clusters before they turn into problems.

Find out how to:
-Diagnose and solve Kubernetes cluster issues with Azure Kubernetes Service (AKS) Diagnostics.
-Maintain your clusters using best practices through Azure Advisor.
-Detect and resolve security vulnerabilities through Azure Security Center.
-Optimize your cluster

A

Alright, let's go ahead and get started like to thank everyone. Who's joining us today. Welcome to today, ciencia webinar optimize, your kubernetes cluster on Azure win, fill in best practices. I'm Carin Chea can be an ad program manager at Microsoft and CN CF ambassador I'll be moderating today's webinar and be like to welcome our presenter today, Jorge Palma senior program manager at Microsoft, and just before we start a few housekeeping items during the webinar. You are not able to talk as an attendee. There is a Q&A box at the bottom of your screen.

A

Please feel free to drop in questions there and we will get through as many as we can. At the end. This is an official of nur of the CF and, as such is subject to the CN CF code of conduct. Please do not add an evening to the chat or questions that would be a violation of the code of conduct and basically just be respectful of all your fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the CN CF webinar page CNCs do slash webinars and, with that I will hand it over to Jorge to kick off today's presentation.

B

All right, can you hear me well Karen awesome. Thank you very much so good morning, everyone, my name, is Jeff Palmer and it's a program manager in the azure compute team, more specifically I'm one of the owners of Hydra community service, and it's kind of mentioned today we're going to talk a bit about how we can optimize our kubernetes clusters and our production workloads running in production kubernetes clusters. So we out further you jumping right through it.

B

So most of you are here so I'm guessing if you're interested in all at least a big part of the amazing capabilities that kubernetes brings us and delivers us in terms of portability, extensibility and all the goodness that it brings into running applications, cloud native applications at scale, but one thing that it doesn't effectively kind of break the table or claim to be is super easy to start super easy to get started or not complex at all to understand so it can be.

B

It can seem a bit daunting at times and we've seen that time and time again, where those five little core components at kubernetes kind of is comprised of plus all the additional new core components and all the occur system that surrounds it has become a bit daunting for folks to really get their good depth into, and then that can kind of put them a bit off into actually adopting the technology and more in the more faster manner and and more confidently as well and at the same time, we've seen that there's been a lot of concerns around how to actually manage these Tsukuba news cluster, in particular things like the control plane like are: do you read the the right high availability considerations in place?

B

Do you have the right harken considerations in place? Are you are you secure? Are you hard? Are you protected and are you scalable for the needs of all you're, essentially your users and your enterprise as well, and so this is kind of seeing the rise of manage kubernetes offers across across the board where effectively the the promises.

B

There is very simple and appealing- and it's just like okay, so you get kind of the goodness of kubernetes and you essentially get a company's API endpoint and you kind of abstract the rest which is handled by a service provider for you, and then you can kind of focus yourself into actually delivering the impact and the value which is actually running some. Your applications and your services inside of kubernetes and using it without having to effectively care about those needy greedy details.

B

But is that then, everything that all there is and we've come to see more and more and more people realizing that? No, it isn't, in fact, if we kind of take a step back and look a bit from a bigger picture for you and focus on what we actually want to do, which is the end goal, is not really just to run a great communities cluster, but actually running.

B

Applications inside such a cluster, confidently running production be able to scale massively and respond to the business needs that those are the actual objectives that normally folks are are tasked with, and so those actually have a bit more of components that you need to look at that are not just core to cubanía, so you need to look at. Is your full environment secure?

B

Do you have good governance across both discriminator, this kind of one that we've been talking, but I also cross your fleet of potential clusters, as well as to the fleet of potential resources that you, using with your clusters. We have a good identity mechanism.

B

We are you or effectively day, to processes your governance processes, matching your your company's needs and effectively what you you're required to do in terms of compliance and standards, and similarly are your developer development processes matching as your your people and your tools are well adjusted into what the new processes are required to be in this kind of cognitive world and more often than not, we've seen that there's a lot missing and there's a lot of common issues that we see.

B

So some of them obviously are kind of number one drive for a minke unit, incorrect company's configurations, and so just because, if we assume like the managed service, for example, that doesn't take away all the knobs that you need to tweak in many cases, it takes away a lot of them and provides you already configured, but in many cases your workloads and your actual requirements will make.

B

You have to treat some of these configurations and in many cases that exposes you to potential other mistakes or complexities of the kubernetes configuration model which can again throw you off and kind of into a slope of the obvious enlightenment, from what kubernetes actually can give you another thing that's been seen a lot. The community's been discussing.

B

Quite a lot is the fact that a lot of clusters having a very permissive kind of role, will access security access controls as well as effectively just being plainly insecure and not having any sort of pause, security or cluster security thoughts behind it whatsoever and kind of a tied to tied to that is a lack of governance policies. So again, very rarely, we've seen effectively users that have only one kubernetes cluster for their entire use of use cases.

B

Rarely we've seen enterprises that have one kubernetes cluster for absolutely everything, all their environments, all their business units, all their applications and as soon as you start discussing and or especially as you start discussing more in one cluster- and you start looking at a fleet of, however, and how big that might be, governess becomes absolutely fundamental to your normal day to day business and to actually keeping those clusters running and secure and tired of that.

B

A lack of kind of central authentication and practice control at scale, which is obviously kuben age, provides you already with the knobs and the actual plugs to get in, for example, central authentication directories it already provides with grenades are back, but in many cases and again once more especially at scale.

B

We see those controls kind of be diluted across just an for sake of operational simplicity and to make everyone's lives a bit easier and those shouldn't be trade-offs really, and so we're gonna look at a bit into how we can actually go back into that that kind of best practice without actually having to suffer from it as per se, and then the the last two are effectively something that I believe everyone wants, which is making sure that the disruptions and unplanned downtime are kept to a minimum and that they don't affect you a lot and that you're in cases where you're being inefficient, you're able to get that efficient backup so again get the best use of your resources.

B

In this case it's your cluster, maxed and optimized. So what actually can we do and what I did in each that we have so obviously, if one of the problems is Khomeini's configurations are incorrect, what we need is validating them right and similarly, if a common set of issues are around the container cluster security and all security again, the solution is to manage them now.

B

I might say: okay, George you're, just literally answering your questions would almost well with themselves we'll take a look a bit into what does this actually mean because they're actually a lot of solutions out there already that deliver a lot of these same needs, but especially when you look at them at scale. There might not be that much approach. There's no there's not a lot. So if you like a little bit, for example, the CNC have webinars.

B

You have create great content for pretty much absolutely every single one of these points already today we're going to look at them and holistically and we're going to approach them again in the full spectrum. And if we move to the next, one I have a curated set of secure configurations and that you are able to enforce them again by the first hint of what we're going to be discussing today, which is policies so essentially having policy enforcement and policy management and then number four again, as I hinted before. Leveraging authentication directories.

B

Make sure that you are we to apply in the fine-grained access, control and scale and minimize the human errors, which normally the biggest causes of issues in a proper leaking kubernetes cluster and make sure that it's hardened against disruption with things like make sure that you have disruption budgets to to to avoid that these kind of voluntary disruptions or kind of simple mistakes that would affect your cluster and then the last one is something that has been approached a few times already and I'd like to take a slightly different angle to it. So it's.

B

How can we be more efficient, for example, from a cost, optimization perspective and obviously the first thing that comes to mind is immediately okay. So we use all the scaling, obviously, but sometimes not just about the scaling per se, obviously tips of mind. We have four essential part of this killer and cluster autoscaler configured are normally kind of just the kind of template answers, and that said, but in many cases that doesn't give you the last mile of optimization those are again given.

B

You absolutely should, if you can't have those two solutions running and those will deliver, I believe the greatest amount of benefits. But then you can actually go beyond that, for example, by optimizing per based on your workload tuning, for example, the after scaling capabilities based on your specific workflow, so that the cluster scales a bit more closer to what your workload needs are and similarly making sure that you're actually infrastructure is also optimized for your cluster and for you, the different workloads on your cluster. For example.

B

If you have different needs in terms of in terms of agent nodes from a hardware perspective, making sure that you don't need to kind of just default to them to the bigger one or default to the more specialized one, you can actually mix and match within your cluster.

B

We do something to raise supports very, very nicely and have different agent groups or different note pools that you are able that you were able to leverage within the same cluster and then place your workloads heterogeneous lis, with the scheduling and placement mechanisms that coronations offers you and finally, using things like spotting census, which is something that a lot of you'll see in our cloud providers, where you have essentially the spare capacity that is offered to you as a lower cost, with a trade-off that it can be reclaimed at any given moment.

B

So this is not suitable for every type of workload, but it is great for workers. Support, is kind of disruption and are fine with the capacity going away and then being skilled back up and are very resilient to this, and you can really maximize your efficiency and kind of minimize. Your costs, if you're able to leverage these charges together all right. So, let's delve into the first one since I, didn't really gave great answers yet just kind of the bullet points, the first one is really have to validate, create configurations and again this.

B

This is the whole topic in itself, so I'm going to kind of use a few sample configuration checks as as an example, and then we're going to jump into a bit in how to can actually apply this in production because effectively, there's just way too many checks and effectively.

B

These configurations check will depend a lot as well on your application on your organizational needs on your workload, needs and effectively any compliance standards you need to apply to, but so let's take a look at a few, very common ones, so the first one just make sure they have readiness and health books configured make sure you have resource quotas and limits configured on your new workloads.

B

Very basic things make sure you don't use like latest on image tags and again all of these could be very much their own webinar topic, making sure they have secure registry and registry making sure that you have the security configured and again a lot. We're gonna briefly discuss this one, but there's a lot that you could be. That could be said about this topic.

B

Making sure that basic turbinates hygiene is is handled like keeping your kubernetes version of today, making sure that your nodes themselves are patched and up-to-date as well, and not just a cuban aids version. So you need to kind of keep boxing tandem and then finally making sure that all admonition of access is minimized, and it is that's something that you can validate and audit all right so on to actual concrete solutions. What can we do?

B

There's a lot of actual recommendation evaluation tools out there that are absolutely great for for this job, so a few of them fall out and you can then check the links on the flies when you win when you download them after the presentation. Things like hilarious, cubes can cube advisor.

B

Skew DevOps kit are great tools that actually allow you in some cases, actually give you recommendations of best practices and validations that you need to do, and in many other cases they are actually able to make sure that your cluster is running according to those to those same recommendations and so validate the configuration of that cluster. Specifically. So in this case, we're going to take a quick look into and we're going to move into our first demo and you're going to take a quick look into the DevOps secure toolkit.

B

See if this is well- and so hopefully you'll see the book to see my screen, so what we have here right now is: we are in a cluster called our back three, and this cluster is fairly simple, so it doesn't really have many things running. The posit we have running here is they have a bit of interesting names. So we have these two privileged container sensitive, mount pods and we have a few attribute back in and edge about front end pots. So what does these pods actually do?

B

Let's take a look at what these actually do, so that those specific pods are, as the name indicates, probably not the greatest thing to have Ronnie on coasters and that's kind of on purpose for demo purposes. So you'll see that they're just kind of basic in genetics, where I added a few tweaks to it. So, for example, I'm running a few of them as priviledge and in another case I'm actually mounting some sensitive paths into it very specifically, and then I also added a few more nasty things like some very high privilege.

B

Cluster rules for no reason whatsoever and then I also bind that into a service account again for no reason whatsoever, except to demonstrate this and in terms of the azure photo application. This is kind of your everyday, simple application, so it's just kind of a readies, backhand and introspective service and just a simple web front-end that is exposed by a an ILP surface. So really nothing special and what we're going to see here is if we actually take a look at all.

B

The pods are running in this cluster you're, going to find out that there's actually a few interesting ones right here and in a specific namespace called AZ SK scanner, which is a job that one's kind of you would as a cron job and will effectively guarantee the assurance of your closer.

B

So what I did before this is I, actually install this continuous assurance scanner and again you'll see details on my subscription and the actual are back three closer that we were looking at and then what this effectively did and you'll be able to see it here if it installed the air security transported cluster. This case it didn't because I already had it installed. So let's take a look at what exactly those the result of those scanning efforts did.

B

So when you take a look at the logs of that pod and probably as you'd expect after seeing the diplomas I had in this close you'll see that I pretty much failed, almost all categories, so it's actually checking a number of things, for example, restrict contain a privileged solution. You'll see, obviously that one failed I have one very specific part to that.

B

You'll see things like one thing that is actually good is that I'm at the latest open aids version, so that's at least that I have going for myself, but again, I am using the folk namespace deploy applications, so even that kind of innocuous extra food, even that one is not really following your best practice and it's just something that might have slipped this one will, let you know: hey, that's not something that you should do according to best practices, you probably everything in their own namespace and similarly making sure that things I'd be able to write the root filesystem and are able to mount sensitive sensitive information into the containers.

B

Then you'll see the information detail for each one, so this is essentially what I wanted to show you this first demo. The reason why I want to show this is because effectively uh natcher you're actually going to be able to see this same information in a bit more graphical way, with tools like as your advisor, for example, so as you'll see here on the right, you'll actually get kind of a graphical UI into pretty much the same level set of information and you'll be able to kind of navigate through it even be guide to documentation.

B

So it's directly from there you're gonna click at your files, recommendations and you're able to see a number of them that you already saw on on the previous demo. All of this isn't on the github page and all the kind of both advisories as well as that we toolkit, is on github again. The links are provided above for you to check all the policies that it's running, to check all the recommendations and and and all the heuristics that is running inside of each cluster.

B

A interesting thing on top of this is a specific security side of it, which is making sure that you have all the security recommendations which are normally of kind of higher priority in check and that you have them all address in your cluster. So on Azure again, another piece of integration that also drinks from this, this secure, DevOps toolkit tool is your security Center, and our security Center will basically have all that information and you'll see on on the screen.

B

On the right hand, side similar things to what you just saw like: hey, I, detected, a new high privilege role or hey I detected a new container. The new pods running in the cube system, namespace or I, detected high privilege containers running so all the same information again in a graphical UI, and that then allow you to as well again to do this at scale, which is to see this across all your clusters and all your fleet and even get alerts based on it.

B

So you can set up alerts based both on the advisor and of operational excellence and high availability concerns, as well as on the security center. More security focused alerts and recommendations, and it even allows you to kind of I.

B

Don't even protects you against a bit more advanced cloud threat, so what I would call bit more advanced outside, which is it's not just kind of warning you about things like hey, you have a sensitive mount, as we just saw our privilege container, but it could detect things like well-known part, remaining images and kind of these three things together really point to a an attack or malicious attack on your cluster.

B

That might be exploiting it to again run corrupt, reminding containers, and so this is actually a true threat that we've seen today- and we actually seem realize today several times and so to have that kind of capability is really important. How to have those checks is really important as well yeah. Let's take a look, then. How does this actually look so when we actually are I'm going to use the yes reporter flows as an example?

B

So if we take a look at this, our our back three cluster and we jump immediately into those advisor recommendations on the left-hand side, you're going to see again very similar feature set up what you saw it just before in this case, for example, hey in this case, you are not running a support. Kubernetes version so make sure that you have that your version within the g8 support versions and you'll see a few security hi incidents that you also see on security centers.

B

So, for example, hey make sure that you have only a specific set of authorized appear ranges where you can actually access that API and if you look at the alert very specifically you'll see hey.

B

These are pretty much exactly the same one-by-one, the ones that we saw before as an example, we have a privileged container detected now we can actually interest back into it a bit more and see things like hey. Why is this alert triggering how many resources is it affecting again you'll see that name? They just saw before that privilege ASC privileged, contain your name and you'll see even how to take action against it.

B

So again, it's how is transporting one of these tools, one of the open-source tools, new validations into how to apply that at scale to your fleet with ease I got a few tips if you're trying to get sorry with with something like this and with a tool of this. This manner don't go all in I've, seen personally a lot of folks fail when they just decide to just alert a base on everything and just try to attack everything at once. Start simple start with a few very high impact recommendations.

B

Again maybe the security ones, maybe the community's version, one, maybe some some very critical ones that you that you see are really worthy of work. Waking you up at night per se and then just make sure that you notify the right folks and don't just stand everybody and then build on that iterate on that. Maybe even build some. Some automation some function functions and run books. Based on that that you can actually trigger automatically once those events you can and that's something you can do it most of these tools as well all right.

B

So, let's switch out into then policy management. So now we've seen a lot of the recommendations and alerts of things that you should be paying attention to and that you should be validating interpretive communities configurations. How can you actually enforce those as well, and how can you enforce those and manage those at scale? So the community has been investing a lot of time and effort into this area and they're great solutions out there. You know be a gatekeeper for an ok rail and probably others apologize if I.

B

If I forgot, there are a ton of great solutions and efforts into this area and they all kind of tried the same, which is offer you kind of a policy mechanism to apply your enterprise requirements in a way that kubernetes can understand and that can validate to make sure that your cluster is according to those requirements into those standards.

B

So the next level of that challenge is again. If you look at something like each one of those solutions, they do a great job. You just install them in a cluster and then you're you're able to make sure that each user kind of abides by those because they're being enforced, or you can turn on audit and make sure that you are auditing every action. The next level of that is doing that, obviously a scale you're gonna see kind of this trend.

B

During this presentation, I'm gonna be calling this out a few times where the ideal scenario that I would envision normally is you have kind of an architect and that that's normally that role that security role in many enterprises? That would be the policy manager role that would be kind of defined policies, and that would be applying them either to assess fleet of clusters or to specific subsets, because maybe many times you might want your closer to all be the same.

B

But the truth is in many enterprise environments and in many cases that's not exactly true, because applications have different requirements. Business units requirements, subgroups, have different requirements, and so, while maybe a specific cluster that runs a specific application or set of applications is exactly the same as all maybe staging environments in QA, environment, etc.

B

It might not be the same of a set of clusters used by a different team, but nonetheless you do need to have a same set of policies that are maybe company-wide and enforced company-wide that you'd like to apply to all of them with exception, maybe of the sandbox clusters or the developer clusters, and so this is the role of this. Maybe architect, policy manager, kind of don't pay too much attention on the name that would be defined these policies, while developers themself, would just be using these clusters kind of safely.

B

Knowing that the clusters are have policy protected and again, then there would be great to actually have a mechanism to report back interacted to say hey. At least all these clusters are compliant, while that one is actually not. You might want to do something about that or you might just mute that warning, because that's expected so again without further ado.

B

Let's take a look at how does this look like and we're going to use OPA gatekeeper for this specific demo, so right now we're going to swap into a different cluster, so we're going to go from our back three into this policy, one cluster, which has nothing running really and I'm, going to attempt to run two specific things. So I'm going to attempt to run this privilege nginx, which is hopefully not familiar to you.

B

It is just a regular nginx with the privilege context and I'm privileged just basic nginx, so I'm gonna try and run the privilege one first and you're gonna see immediately hopefully seconds that is actually going to get an error. It's denied by policy. Why? Because it's running a privilege container, if you actually take a look at the constraint and the template you'll actually see exactly which policy we're applying here. In this case, privilege containers are not allowed and you'll see things like hey. What is the enforcement action in this case?

B

It's deny what are the excluded namespaces that you might put into it so again, just full fledge of it OPA gatekeeper capabilities. Similarly, though, um privilege is also and allowed, and that is because there's yet another policy in this case, the container image is not allowed. It comes from a source that is not allowed. So if you take a look at this actual constraint, you'll see that I have an image, rejects from a specific, very specific registry, so I'm actually able to guard which registries this cluster can pull images.

B

So if I actually now add that into the image and try again with some luck, this will actually meet all my policies and effectively it has been created, and ideally it will be running.

B

So this was a very, very good example on how this works. Now from a second cluster, let's say: policy, two in a different research group, ha we could actually apply privilege. So that means that this cluster does not have policies applied to it and you'll see that while I have my policy one right here already, you would policy in force in this resource.

B

Group I also have policy two and three right there that are in a different resource group called CN CF two, let's assume, maybe it's a different business unit or different team, and these two clusters actually don't have policy.

B

So again, I could go ahead and saw a keeper in each one of them, etc, or even make some sort of automatic pipeline that deploys all my clusters or I could just go ahead and leverage the integration with, for example, election policy, where you could go ahead and apply both of these policies, granular to say the whole subscription or to set of clusters at the same time.

B

So the first one I'm going to do is actually basically the the baseline from pond security policy, but through gatekeeper, so plus security policy, as you know, is kind of in beta and likely to never graduate, but actually with gatekeeper. You can get the exact same control, so here you'll see essentially all the controls from the baseline prosecutor policy all bundled up into an initiative. And now what we're going to do is we're going to essentially select that resource group that has those two clusters and, at the same time, we're going to apply this.

B

This baseline, the spot security baseline through gate years through a be a gatekeeper into all those clusters at the same time, so we could select, so I could be a bit more permissive and say I just want to audit again, depending on different environments, for example or I. Can it be very prescriptive and say no I'm going to deny anything that does not comply with this policy and you'll see already?

B

The assignment has been done, so we can go ahead and test it out, but have in mind that that was not the only policy we had on on policy one. So we actually had an image registry provenance policy, and now, let's take a look, a quick looking all the built in policies you can do. Obviously we get keeper, you can customize your own and and do your own, and here you just have a set of already built in and pre created policies just to help you out get started.

B

We are having to actually create your own, so this is the one the last one right there, the insurer only allowed container images and you'll see in this definition that, as you might expect, this is going to actually have a constraint, template and a constraint again basic, okay, keep your constraint and template, and they actually point to github, which you can actually see here and all it as well. So you can see exactly what these are doing.

B

Even before they're running in the cluster, you can access them and inspect them very granularly, and now let's go ahead and assign it now, you could decide that no, actually, this is a deaf cluster. I'm, not gonna, effectively apply this registry policy into it or you can decide now. I want all my clusters in this case, to be consistent and to have these.

B

These two sets of policies apply so we're going to again apply to this CNC f2 cluster, and this one aside from deny, we actually have an additional parameter, because again you notice that I had that rejects in my policy that basically allow only for my registry and no other registry to be to have image food from so I'm.

B

Just going to be, do a bit of lazy here and I'm just going to copy paste and add my registry once more into it and I'm going to create this policy and so again like policy one now, policy, 2 and policy 3 actually had the exact same set of constraints.

B

So if we, let's make sure we are in policy too, and it tried to deploy this privilege again, not possible anymore, and if we try to deploy this again, we'll get the exact same thing from policy 3, and if we had 100 clusters in that research group, all the 100 clusters would be effectively protected by the same policy right now.

B

So this is policy enforcement and policy enforcement at scale right now. Let's take a look at the last two topics here, which would be authentication at scale and effectively authorization at scale. So I tried to make it a bit more colorful, but I.

B

Don't think I actually made go off to any easier with this image, but the point here is: make sure you leverage all the goodness that is out there already cooper names, provides it with all these fantastic kind of plugs and knobs to integrate with existing directories, and it could be as in the picture, that's the directory, but could be with Dax.

B

It could be with effectively anything that you use for for your for your user management and you could integrate it directly by the OID C parameters or you could have your own custom authentication, webhooks server, something like guard, for example, and you can leverage all these different mechanisms is. My point here is that make sure that you do because they're extremely valuable I'm not going to make your pitch on why a central authentication is clear. I think there's been enough of that in the industry over the last years.

B

The next one may be a bit more nuanced is a three zation scale again. Kubernetes actually provides you with very, very rich auerbach mechanisms. There are kubernetes native that allow you to secure, for example, what a user can, what actions a user can do within a specific, for example, namespace, or will you what actions either can do across the cluster or maybe even a service account and do across the cross, a cluster etc make sure?

B

But many in many cases that obviously is internal to each cluster, then the question becomes once more how to apply at scale and once again, a solution. I might just be to have a configuration management that ensures that all clusters have the exact same rules having exacting service accounts, bindings, etc, and that all clothes should have that bootstrapping done as they come up and get that reconcile.

B

But there are many other cases where you might have want to have a different set of management, where you don't want all your closer to have that or you want different clusters, or you want to have a more manual and direct in back to that, where you want to control that are back and similarly, in many cases and some of the tools that I mentioned above actually solve them, provide that is, you might want to have slightly different, are back constraints and mechanisms.

B

For example, you might want a user to be able to only manipulate objects with a search on label. Again, that's not something that creates a native. Our back supports already has all the flexibility you need for that, but Kurien ADEs does give you the possibility, to hook in, for example, a customer service agent server that you could then make whatever you needed to do. We're going to take a look at how a customization server would make your authorization scenario at scale, much easier.

B

So, once more, let's take a look at this in practice, so we're gonna switch into this. Our back five cluster and we're gonna get do a get pause. Now you notice that immediately got popped up a login screen that prompts me to do a device login. So now you get to go with the joy of seeing me.

B

Do the multi-factor authentication that I have with Microsoft now after I do all of that you'll see Oh God, we actually got a forbidden response, and that is because this cluster is just for actually created, and my user has no permissions inside that cluster.

B

There was no bootstrapping done whatsoever so effectively that user, you just saw, has no permissions, but if we jump into its I am blade and if we jump into the rules that I can give them, you can actually see that we have a number of our back roles and especially this reader rule that I find very interesting cuz. It's exactly what I need I need to see things all across this cluster, but I don't actually need to do anything to it.

B

So I'm going to apply this reader rule, which is essentially the same reader rule that you'd normally find on the built-in rules for kubernetes, allows me to see anything except secrets. So I don't elevate. My privilege and you'll see things like the admin role or the right to rule which allows me to pretty much change objects without elevating my privilege or changing anything with regards to to authentication authorization, and these are all built-in rules.

B

You could actually build your own custom rule if you want by effectively leveraging the be all the actions that you know on all the API groups from kubernetes that you know so, for example, you could see hey I can do deployment read and deployment right, but I cannot touch daemon sets, for example, and you can make a very customized role for that now.

B

If you do get pause, hah now, actually, I'm able to list thoughts, I'm, actually able to list notes and I can even list pause across all namespaces and see them happily running, but if actually tried to deploy my boat application. I can't again I'm just a reader in this in this specific cluster. So apparently, I do need a bit more permissions.

B

So let me get the ID of this cluster really quickly and let me apply it an additional permission, but in this case is going to be the admin permission, but not to the whole cluster I. Don't need that so I'm going to very granular, apply it to the default namespace, where I wanted to play my application, and after that is done, we I'm actually gonna, be able to reapply this yamo and at this time, I'm actually able to create this.

B

But if I try to, for example, create a namespace I'm not going to be able to, because once more I only have admin access to default to the rest of the cluster I'm, only a regular reader.

B

Now this was one cluster I could have done anything I just did really with kubernetes our back I'll see applying this at scale and doing this without any bootstrapping. So I have this: our back six cluster I'm gonna do the exact same mechanism that I did for our back five I'm gonna go ahead and either roll assignment and I'm gonna do it for the whole resource group. In this case, and once more much like policy I could do.

B

The whole research group were a whole subscription to the whole fleet of clusters or to a specific set of clusters, depending on what my needs are as a user, and then the admins could actually control that very easily. So I've just done that and I'm gonna switch to this. Our back six I'm gonna spare you the joy of actually having me go through the loops of multi-factor authentication again but effectively. This is the first time that I'm logging into this cluster.

B

There is no, as you can see, by the prompt, there's no bootstrapping, that I do to the cluster whatsoever, except actually adding it directly from outside the cluster, but at this point, I'm already able to list pods but not to apply the animals so exactly the permission that I wanted to have and that I can apply at scale. Very.

B

Very easily so if a new user gets in I can just immediately grant and read access and then go ahead and grant them has been access to specific clusters or even play around with situations like just in time where you could give a user just-in-time access to a specific namespace or to a specific cluster to perform a high privilege action just during given a period of time and then he's privileges, I revert back into the standard which could be, for example, reader access and reader access could be used, for example, to get logs and to just inspect the the applications without actually being able to modify them all right.

B

So obviously, the next part of this, if you're looking at production, is obviously diagnosing troubleshooting and how to make sure that both your cluster is healthy and your application is healthy and once more, this is in itself a whole topic for a full presentation, so I'm not going to much like in on the rest of them. I've, really deep. Just going to make a few considerations then give a few examples again.

B

The great tools that you have out there is the weave, scope, periscope, I, guess, agnostics, all those open source tools that you can go ahead and use in your clusters, and you can actually then leverage similar integrations that I mentioned before for advisor and Security Center for, for example, periscope and IKS Diagnostics.

B

So you can see a page where you can even from your vs code editor, you have your kubernetes extension and you have your kubernetes cluster is listed there and you can go and just go right ahead, right click and see that agnostics for that cluster running in Azure and you'll see pretty much the same page that you see in screening and you'll, see, for example, hey this clusters is, is healthy, so you kind of run a full check up on that cluster clusters. Healthy work, those are healthy, no problems whatsoever or otherwise.

B

You'll actually see some red alerts. For example, your network configuration is, has an issue or you're running out of IP ease or your outbound connectivity is failing and even some recommendations on next steps and what you can do as potential solutions.

B

Now, for our last point, which was fairly interesting, which is efficiency, efficiency and cost savings so to measure that on our cluster, you actually get a recommendation, such as hey make sure that you enable out the scaler again as I mentioned. That's the easy part you can just go ahead and Nabal autoscaler, the more nuance part and where you're gonna get get. It still get a lot of benefits.

B

That a lot of folks are not leveraging is making sure that you are leveraging that a true genius capabilities of communities and at the infrastructure itself, making sure they're using different node and its cues and and agent groups, and no tools that have the right size VM for your workload. They can then place there and similarly leverage all the tooling kubernetes gives you like cost also scale and the echo system as well. Let's start the scalar or Rozonda pod, the scalar, using combination with something, for example.

B

I brittle cue blood, which is a sandbox project from the CN CF, to create a virtual. If you will note that you can then burst into and that and that virtual node will then point into a service, so we'll take a look at that in a second and then use that even again more. We in combination with things like these scheduler, where you could use it to response unbalanced clusters or unoptimized clusters, because again Karina's scheduler will respond to effectively what the current state of the cluster is.

B

So if you have two applications and one of them is very scaled up- and you have a nice set of affinity or node selection happening between those two applications when that second application starts to scale, the scheduler will take decisions based on the current state of the cluster, which was application. One was fully scaled out, so that might not be the best selection for scaling up application to and when application one skills skills back down, then we get stuck normally we'd get stuck with the balancing that happened on application to wedi scheduler.

B

You can actually reach shuffle that in runtime, by running kind of a a cron job or a job that will periodically check if everything is according to your policies. Again, something like. Are you respecting your affinity? Are you respecting that you're, the poorer age?

B

Are you respecting the number of restarts and based on that, you can take actions as well automatically and keep your cluster balanced and optimized, and then, finally, there's even very interesting projects like like kada kubernetes, event-driven, auto scaling that actually allow you to optimize your scaling based very much on your work like detector, in this case, for example, be event-driven, so, instead of actually having something running there with one replica and then just waiting for CPU or memory or for some custom metrics that you can put it into HPA, you can actually use SCADA to, for example, respond to an external trigger and then based on that scale from essentially nothing even zero into the necessary number that you that you get to now.

B

If I went a bit too fast, I'm gonna go and do a quick recap for you. So horizontal part of the scaler that most of the folks on the call know what that means. But quick recap: it's basically how you can auto scale your deployments based on their metrics, so actually is marching by default. Community metrics server collecting all the metrics from from your pods and then based on that it will actually take decisions into scaling up or down the number of replicas of your deployment.

B

Okay, so easy enough, it will just scale your deployment up and down based on the usage that your application is having from the memory CPU perspective, but also you can hook up specific metrics to it as well, now close off the scaler. It's all, then, is about the infrastructure side if you're scaling your deployment kind of rapidly. What happens to your nodes so close, all the scale actually looks at pods in pending state that have no place to go because you run out of nodes and it will actually talk to the virtual provider.

B

Depict in the picture is out here, but could be other infrastructure provides that support cluster scalar and it will actually add more nodes based on the number of pending pods and what their requirements are specifically now. If we then add something like virtual cubelet, you can I recommend it. You take a looks one of the sandbox part of the CNC F.

B

Now things get very interesting because, aside from having those notes, you can actually have this virtual mode where you can burst engine so that virtual mode as I mentioned, can actually talk to an API like in this case actually container instances which is essentially docker run for those of you that don't know what container instances are. It's like docker run in Azure, where you can just send a container run it there do a CI run and it runs true.

B

News can actually leverage that to can essentially what it needs to do is to place pods and containers as well, so, instead of just putting them on actual nodes, it will just use this virtual node to talk to this API and place them into that service itself and with that kind of virtually creating an infinite note that it can use now.

B

This is great for bursting, obviously, for if we take a quick peek back into optimization, which is what we want. You probably don't want to leave these pods running for a long period of time in these additional services. You want to maximize once more your actual infrastructure, which is where you can get the most savings from so you'd like those to actually come back, hit the nose, and so once again remember what I mentioned. You can actually use these scheduler and similar projects to do that kind of balancing after that is done.

B

So you can, for example, use a combination of two scalar virtual, no virtual cubelet and virtual nodes, and these casual work actually make sure that you are able to respond to that burst and to the very fast demand, because the fact effectively what what virtual cubelet delivers versus Tardos calories, that cluster autoscaler needs to provision nodes and they need to come up. We've been if they're VMs, which is pretty fast. It still takes a bit of time and gives virtual nodes it's just again.

B

It's just talk or run so you know how fast that is just a few seconds you get more containers running, so it's very, very appealing but again to maximize you'd, probably want to bring those back into your actual nodes, especially for long-running purposes. So again, we'd be scheduled in class out the scaler.

B

You can follow up on those bursts and then balance your closer back into a more efficient and optimized state, and then finally, things like kata, really really kind of bring a different level of abstraction and differing a different level of tuning and and work hold.

B

The line scaling where I can actually have external triggers, where you can actually go from zero, so you can have your deployment all the way at zero and then occupying absolutely nothing, and then, after that, just scale it based on an event up to 100 or whatever number of replicas you can go to and again you can use this in any kubernetes cluster.

B

It's also an open-source project that I thought I linked to and I recommend that you test out for your workloads, especially if you're doing any sort of event-driven programming or you have any event-driven architecture in your applications, all right so for the last demo. I just want to quickly show what this looks like.

B

So let me just switch into first cluster, so what you'll see here is you'll see. I have already a few instances of my store, running and you'll, see how Prometheus running and you'll see something interesting, which is.

B

In terms of notes, I actually have only three notes right. There and I have a drink'll node right there. So let me actually go ahead and get this off and I'm just going to produce some load.

B

For sometime, we'll actually should be able to start seeing a lot of these pods starting to come up so I know we're close to time. So I do want to leave some time for questions. So I'm gonna leave this one running and allow you to ask some questions.

A

Are you gonna go through the takeaways, or is that just to be on screen yeah.

B

That's just to be on screen mm.

A

Okay, well, um thank you for the presentation. Well now take some time to go through questions as a reminder, if you do have a question, please drop them into the Q&A box at the bottom of the screen and we will get through as many as we have time for so George. First question: do we have similar controls and other stringent policies for service fabric? Also in general, for adjourn offices? Would you suggest to deploy Canaries apps to service traffic? First then? Yes,.

B

It's all depend on, obviously your your needs and workflows that they are two different services. So from a purely Azure perspective, this presentation is obviously focused exclusively on kubernetes. I sure has a lot more services that are fit for purpose of specific work, clothes, app service, Web Apps, has functions, that service fabric and so I, as a generic rule of thumb and I.

B

Think that's valid for everything you know, provider is choose a service that is more fit for purpose for your workloads again, if kubernetes has a tremendous ecosystem right now, where effectively, you can run anything there, which is one of the great advantages and run it with the same confidence but again there. If you're looking for super simplicity and just a way to, for example, a CI was something that we mentioned here.

B

If you're just looking for something to run a container, there's really no point in setting up a full crew, Vinay's cluster, you can set up something locally with like key 3 es or with mini cube or something similar or you could just do something like a CI run on Azure. So similarly, the same would apply to service fabric service fabric is a bit more about running microservices more than contains itself, it happens to run containers, but it's essentially a micro service platform, better than surprise ready for that.

A

Great um question we have, when asked: is there a default HP, a metric server that Asscher provides specifically for custom metrics yeah.

B

We do have and I'm happy to the liquid on the on the the finest flight version. You can it's a good huh project as well public. We just custom metrics, adder and you'll get custom metrics based on kind of extra specific metrics and signals great.

A

And then follow question: is there a plan to offer D scheduler as an aka on add-on, I'm? Sorry as an e KS add-on? Oh.

B

Not as of right now, we haven't got any any large amount of requests there and we kind of are very happy when folks rhonda, full spectrum of community solutions into your clusters, we're very committed to make sure that it s is 100%, vanilla communities, but it is something interesting if that's something that that you see as a very interesting piece for you and for unfree or clusters. Just let me know my get happen. No I believe it's on the beginning. I also just hit me up on Twitter and we I'm happy to chat about that.

B

So it's definitely a possibility, but not something that we were looking at writing great.

A

Do you have any best practices to prepare for a failover of a cluster in a worse case, without coping or replicate or yeah, without copying, slash replicating everything, um especially as your resources for production usage.

B

Yeah, so definitely the BCD. Our scenario is once more one of the interesting challenges that we see in Cuba news today. So what I would normally once again, I don't have a great just, okay and answer for it. It really will depend on your needs on your workload.

B

Needs on your RP, o RT o needs a common, architectural high level architecture that we normally see is folks running kind of an active-active cluster where they actually have the second one, really really slimmed down again to have costs to a minimum or even to zero, if possible, and then just in case of either disruption.

B

But it can even help in cases of high load having that kind of like a Traffic, Manager or DNS low balance or 0.2 to active active clusters and have that second, one really scale aggressively if it needs to come into play and to help out them in cluster or in cases of disruption. Then, obviously, the recommendation again if we're looking at it from a pure perspective, is to have a lot of the services that you use with your cluster.

B

If you're looking at something data or similar to use something like geo replicated, services like cosmos, DB, etc, which would make your job easier, because otherwise, then you need to have a BCD art mechanism for storage and for stateful sites and persistency, which can then apply with things like, for example, Valero or similar backup solutions, but then amitabh to make that specifically. So normally we've seen more success with kind of active active and then G replicated services on the backend architecture.

A

Okay, I'm gonna do one last question: given the time is there documentation available about, chose the overlap, and/or differences between OPA and azure policy.

B

Yes, so there is documentation and it is linked on these slides as well the difference it's not a difference, so there's not two they're, not two distinct services. I want to make that clear if it wasn't during the talk and her policy is merely using OPA I'm, merely providing you with a interface to use OPA at scale. So it is, it's not you're, not two distinct services.

B

In fact, the link in the pages that I showed actually showed the exact OPA constraints and templates that our policy is applying to clusters and you can see them on your clusters running after its applied. So as your policy, obviously the more bread broad service that you can apply to all asher services and control all kinds of different things.

B

From a kubernetes perspective internal to cluster, it literally uses opa gatekeeper to to make sure that the equalities are enforced and to get also the other end point and make sure they has an end point all of them as well. So they are, they they play together, they're, not two distinct things and I know that this was my last question. So I do want to make sure that folks know that if you need to post any additional question- or you want to continue just make sure to reach me on Twitter really happy to that.

B

You gave me this time and thank you very much. Yeah.

A

um Thank you, George for a great presentation. That is all the time we have for today and thank you everyone for joining us as a reminder, the webinar recordings and the slides will be online later today, and we look forward to seeing you all at a future CF webinar thanks and have a great day you.