Cloud Native Computing Foundation GitOpsCon EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Organizing Teams for GitOps and Cloud Native Deployments - Sandeep Parikh, Google Cloud

Description

Organizing Teams for GitOps and Cloud Native Deployments - Sandeep Parikh, Google Cloud

Large scale Cloud Native deployments typically include multiple teams running multiple applications across multiple environments - but how should teams be organized to enable efficient software delivery? How should responsibilities be split between platform, DevOps, and application teams? In this talk we’ll walk through the different approaches teams can adopt for organizing Git repos, handling upstream dependencies, and managing software rollouts. This talk will go in-depth about repo structure and strategies for managing the release process, as well as how to enforce policies across configs and manifests.

A

Hi folks and welcome to organizing teams for git, ops and cloud native deployments today, I want to share with y'all some of the things we've learned: helping teams adopt, git ups for cloud native and some of it's based on our own research and some of it's based on what we've observed with teams directly.

A

My name is sundip and I've been with google cloud for almost seven years. For the most part, I've had several different roles and titles over that time, but ultimately it's revolved around helping teams adopt and optimize for cloud in some form or fashion. You can always find me at circus monkey. That's crcs mnky on twitter. If you've got questions about get ops, devops or anything else that comes to mind now, there's quite a lot. I want to cover with y'all today, so we're going to move through it pretty quickly.

A

First off I want to share some of what we've learned through our own devops research, and then I want to lay out some basics around some of the challenges and assumptions that we're going to make and use throughout the rest of the presentation.

A

Then I want to then we're going to jump into different cloud native tenancy models and some of the associated workflows. Next, I want to talk about versioning strategies as it relates to different deployment environments.

A

Then we want to cover how teams collaborate together, particularly around upstream dependencies, and finally, I want to talk about guardrails and preventing declarative and imperative operations from breaking things in prod.

A

So let's kick things off with devops research, dora or devops. Research and assessment is the largest and longest-running research project of its kind.

A

Dora's goal is to provide an independent and tool agnostic view into the practices and capabilities that drive software delivery, performance, rigorous statistical methods are used to present data-driven insights and the most effective, efficient ways to develop and deliver technology.

A

Ultimately, our goal with dora is to use what we learned to help teams improve their own software delivery performance.

A

But how do we measure that software delivery performance? Well in our research program, we have found a valid and reliable way to do. Just this. There are two metrics representing speed and two metrics representing stability for the speed metrics. We have deployment frequency, which is how often you deploy, and we also have your lead time for changes and that's basically measuring the time from a commit all the way to that commit being deployed into prod and then on the stability front.

A

We have change fail rate, which essentially, is how many changes do you ship that have failures within them and then the time to restore which ultimately measures? How long does it take for you to remediate a broken problem in production?

A

Now, these four metrics can be applied to any kind of software delivery, whether it's web or mobile, firmware. What have you and using these metrics? We can actually bucket teams into specific categories into low medium, high and elite software delivery performing teams, but these are just trailing indicators of software delivery performance and that's where the leading indicators come in now we don't have time to go through all of the analysis from dora, but we know that there are specific leading indicators that drive software delivery performance and have positive impact for the purposes of this slide.

A

I could only include a subset and I tried to capture best where get up sort of fits.

A

You know the capabilities listed up here on the left all contribute to improvements in either culture or continuous delivery, and those two have the most positive outcome and the most positive impact on software delivered performance, and they also have the most positive impact on the stuff that drags teams down the toilet of rework and deployment plane, which ultimately leads to burnout.

A

Now something that can be surprising when we talk about all of this stuff, from research from dora and from the devops research is that it's going to slow things down right. If we get better at the process, we get better at the approach. We will lose velocity.

A

Actually, what we found in the research is that it's actually the opposite, as teams have gotten better and more stable, they're actually able to increase their velocity, and it actually all comes back to a concept from lean manufacturing, which is this idea of working in small batches and working on those small batches.

A

They have little sort of little impact on the overall system, but you have a bunch of them and they move through this pipeline in a way and it's easier to roll back, a small change and it's ultimately easier to operate again, high velocity and with a high degree of stability.

A

Now dora applies predictive analyses to identify specific capabilities that are associated with high performing teams right. That's what we've learned from all this research and that's how we see this reflected in terms of the stability going up and the velocity going up as well.

A

Now, that's just a little bit about devops and it's important, because I think what we've learned ultimately from dora. One of the big takeaways for us is that it's not about the tools it's about the process and the people involved and that's what really drives the stability and the velocity improvements.

A

So now before we get further in, I want to lay out some of the ground rules, starting with some of the challenges that we're going to see and giving you some baseline assumptions that we'll work with throughout the rest of the talk now the overarching challenge for cloud native teams, those are relatively simple and straightforward.

A

They usually amount to having a lot of individual teams pushing code to a lot of deployments or deployment environments and those deployment environments are spread across many or multiple regions. It's a simplistic view. Ultimately that encompasses quite a bit of complexity. So let's try to break it down, starting with some foundations and some assumptions.

A

So why do teams even want git ups? Well, it's because they want to get out of the imperative operations business right. These sorts of approaches are hard to scale hard to fix and hard to roll back, especially in case there is a problem. So we don't want to do this approach anymore, so we adopt a git ops approach that gives us some very specific properties, namely declarative right. It's a system, that's a system managed by get ops, must have its desired state expressed in a declarative fashion.

A

This also makes our resulting infrastructure applications, versioned and immutable right, because desired state is, is stored in a way that enforces immutability and versioning retains a history of what happened up to that point and with git ups we want to be able to pull software, and you know updates automatically, so we want to be able to pull those in from the cluster and pull them in from those repos, as those changes are committed and made, and we want those changes to be continuously reconciled for any protection against drift.

A

So if in case, there is an imperative operation against a particular cluster, if against a resource, that's coming from a git repo, we know that that imperative operation will actually get overwritten on the next reconciliation loop. So these are the principles that we get by adopting a get up style approach so now, with git ups out of the way, I want to talk about some of the other assumptions that I want to make up front so for our notion, around kind of many teams we're going to categorize them into some pretty coarse buckets. You know.

A

Forgive me. We have application teams, operation, teams and platform teams now for infrastructure, we'll be assuming kubernetes as your cloud native deployment, which makes sense and some git ups tooling. We don't need to be specific about which getups tooling, whether it's argo, cd, flux or config, sync just know that most of what we're going to talk about involves one of these sort of popular git ops tools and then for deployment regions.

A

We base that on where our clusters are going to be physically located, whether that's in a private data center in a colo facility or in a cloud region or cloud data center and finally, for deployment environments. We have the standard prod staging qa and dev.

A

Now, before we move on, I want to talk quickly about teams and responsibilities, because I know I had those kind of course buckets, but I want to make sure we clear on who is responsible for what, in this side of our setup, so starting with application developers, they're of course responsible for all things related to application, right, building, packaging testing etc.

A

Well, then, we have app operators, they're responsible for deployment, manifests and making sure the app or service is up and running, and then finally, we have the platform admins. They cover the infrastructure bits, not necessarily the compute layer of kubernetes, but they may, but they cover the just one level up from kubernetes, so things like rbac quotas, resource limits, all that sorts of work right. The kind of initial infrastructure that has to get laid down on kubernetes before application teams can run and scale.

A

Now again, these are coarse and imperfect kind of categorizations, but in my experience most organizations can model their way into something that approximates this division of responsibility all right. So now, let's go through a couple of different tenancy models and some associated workflows.

A

Tendency in kubernetes comes down to ownership and access. If your team has the run of the entire cluster, it's probably a single tenant setup and having access to that whole cluster, though, doesn't necessarily mean you'll, be given cluster admin privileges.

A

It just means that you can deploy essentially willy-nilly and you have access to the whole thing and ultimately, the reason you probably have a single tenant set up is because you have some sort of particular scaling or hardware need like high performance storage or attached gpus.

A

If your team gets a name space as their only playground, then you're probably living in a multi-tenant world, and this tends to be the case for the long tail of application teams deploying to kubernetes environments now, regardless of which approach single tenant or multi-tenant. The platform team still has a role to play. So, let's explore where they fit into this equation as well.

A

Now every organization employs different collaborative approaches and cloud native deployments are no different.

A

Ops teams may have shared repos with platform teams, or they may have distinct repos in the case on the left, the getup setup process is simple, but the organizational process may be more challenging because there's more coordination involved if two teams are sharing the same repo now on the left. Well. Well, sorry on the right you've kind of flipped that problem on its head and you've made the get up setup more complex with distinct repos, but the organizational setup is easier right.

A

These teams don't really have to communicate and they'll just push their own objects to their repos and their repos will get pulled into those clusters now, ultimately, as long as teams aren't stepping on each other, this setup is all good.

A

Multi-Tenant approaches, essentially look very similar right. The only difference here is the scale and complexity of the repub management in a shared repo approach. This may have to be accomplished via things like pr reviews on protected branches or, if you have distinct repos, where platform and ops teams are completely separated. Now this simplifies most of the day-to-day git management, but it does make the get ops configuration much much more complicated with this distinct repo approach.

A

The getups tools that are out today have different ways of supporting this, like argo, cds, application or app of apps model. So there are different approaches out there. That's one with argo with config sync: there are other repo and distinct sort of root, repo and separate shared repo options as well, but ultimately you're putting the complexity back onto the git, ops, tooling, and you're. Simplifying the work on the organization.

A

So, let's take a look at an example: workflow, where application operations and platform teams all have separate repos, but they are effectively able to collaborate without stepping on each other. Now it starts with the dev team writing code and their applications and building those artifacts, and then those artifacts are stored in some sort of artifact repository.

A

Then ops teams start with a base config and build their specific manifests referencing, those artifacts that the application team built those artifacts sorry, those manifests are then hydrated into actual config and stored in a deployment environment, repo usually specified by a branch and then that config is continuously delivered to kubernetes, so they're, just a distinct, a distinct step between their sort of templated configuration and their hydrated configuration.

A

Now the platform team writes the infrastructure and policy manifests and pushes those directly to kubernetes or through some tooling. Those are intended to prevent imperative or runtime issues like quota or resource limit violations. So that's one kind of example: workflow.

A

This is not a again a one size fits all and you'll hear me say that many times throughout, but it's one example of how these three teams can sort of coordinate and collaborate together.

A

Now, if we build upon that example, we need to talk through some additional considerations as well, so for starters, what if config and infrastructure right that the ops and the platform teams what if they had a shared repo between those two teams? Well, how do those teams need to work together is the repo owned by the platform team, or is it owned by the ops team? Are there weird permissions or protected branches that we have to worry about if it's owned by platform?

A

Can ops just push whatever commits they want, or in like the shared model? Is there a pr based approval process for things on a particular branch like prod, maybe or does ops have total control there as well?

A

Now, when we talk about cluster objects, this is where you know what, if ops, teams and platform teams need to actually collaborate on things like quotas or namespace configurations well in a shared repo prs could be used for that sort of work, but with distinct repos.

A

There may be other approaches like, for example, the ops team can make prs against the platform teams repo, as it relates to getting things like quotas, name spaces, resource limits, but the platform team should use policies as a means of ensuring ops doesn't deploy weird things or break things in prod, so the collaboration is very different right one's a pr based approach, one is sort of a policy based approach, does really come back to how your teams view the division of responsibility and how they want to divide that work.

A

Now, there's also the option of you know, maybe there's a an approval process at continuous delivery time before objects, get pushed to kubernetes and- and maybe that's done by the platform team. Maybe it's only particular to maybe prod, but not the other deployment environments like dev and qa, and staging right, because we want to stay out of people's way as much as possible and let them work quickly.

A

Now these are all the sorts of things that need to be understood and again it's not a one size fits all. Every organization is different in its own ways, and everyone views the division of responsibility and the ownership in different ways. So, instead of trying to figure this out with tools which is not going to work, don't let the tools drive this process.

A

This should be decided and documented by platforms and ops teams together right, they should be collaborating on what the process is going to look like before they even get to the tools, because the tools can be made to do whatever you need and they can follow whatever process you've got.

A

But if there's no clear indication of ownership or permission or responsibility, then you're left kind of wondering how we're going to fix and understand. All of this now versioning is the next topic. Versioning is relatively straightforward, but there are a couple of considerations to remember so now. This is not hard and fast guidance. That's 100, correct by no means, but for many organizations and their teams, a branch per non-production environment. That model tends to work well and provides a pretty clear process and lineage for git ops deployments.

A

So that means for dev qa staging we use branches within our git repos to define what gets delivered to kubernetes. Now this example yaml on the bottom half of the screen is from an argo cd application, manifest where we're specifying the target revision to come from the staging branch of this repo.

A

Of course, the other get ups tools out there like flux and config sync support similar approaches. I just wanted to put up one example here now for releasing to prod. We move away from the head of any particular branch to something a little bit safer, the safest approach, always the safest approach is to use a commit hash.

A

Hashes are immutable and give us a very specific commit to pin against, but as teams and applications grow, this can be challenging to maintain, especially as teams have different release velocities and the number of applications sort of explodes over time.

A

Instead of trying to update a whole bunch of git ops, controller crds with different commit hashes all the time, we could take a slightly different approach and use tags tags are the next best option from using a commit hash. Their only downside is that they are immutable, so we're back to having good process and hygiene around git be really really important to help keep this from becoming a problem and from get to keep it from getting abused.

A

Now, regardless of whether you deploy to prod via commit hash or tag, you want to employ some good basic principles and practices. First and foremost those crs that specify repo hashes or tags, those should be deployed in a declarative manner and not using any imperative approaches like cli or some other. You know imperative other approach out. There, then you'll want to build out some sort of distinct delivery process outside of your application pipelines to deliver these updated crs that specify new.

A

You know, branch names or new, commit hashes or new tag names and that deploy process should match what your organization wants, whether they want to do kind of a blue green deployment and switch 100 of the traffic over or they want to do like a canary style process. Where you know, small percentages of traffic are shifted over to newer versions of the application. That's really again back to what your teams want as that sort of outcome.

A

And finally, you want to have a documented approval process, whether with via human or via automation, to orchestrate a safe rollout, but everyone should be able to say how is this deployment going to get orchestrated? Okay, it's via automation.

A

That means the automation is going to check for health checks and readiness checks before continuing to progress further into the deployment, or is there a human who makes that decision and says we're going to deploy 20, I'm going to check the numbers and then I'm going to deploy up to 50 and so on and so forth, but it should be written down and transparent to every application team. So they know how that deployment process to prod works now. Another way that dev, ops and platform teams can collaborate is via an upstream dependency process. Now.

A

This is often done using things like helm charts. So in this section I wanted to quickly mention another approach: we're not going to spend a ton of time on it because we're flying pretty quickly through all this, but I want to make sure y'all are aware of what other options are out there, especially ones that match the get ups model much more closely, and that approach is called kept.

A

We don't have time for a full-on kept tutorial or walk-through, but I'd recommend y'all. Take a look at kept.dev I like to think of kept as basically another way to use package management semantics, but with bundles of kubernetes config. That's it now one example that often comes to mind when we talk about upstream dependencies is this idea of having approved software packages that can be used by application teams. So you could think of things like you know, reduce or mongodb right.

A

Maybe the platform team or the security team has approved using redis, but they also have done it with very specific configuration details. So they don't want anyone just grabbing the redis image and just deploying it on their own. They want to have kind of a carefully controlled red s. Artifact and redis configuration that gets deployed.

A

So how do those platform teams then share that with their application teams or their ops teams? So this package management approach is pretty helpful because they can actually one pull that packaging in, for that bundled config of you know redis configuration, but it also provides the opportunity for them to update that as well. So as the platform team updates that configuration or revs that version of the redis deployment, the application teams and the ops teams can pick up that update again using regular old. You know package management, style semantics.

A

So that's why I like this approach and it's worth looking into now. The last topic I want to talk about as it relates to teams and git ups and cloud native, is around guard rails and guarding against danger I'll be using some key terms as we talk about this in the next section, so I want to quickly define them up front now. First, is policies. Policies are rules that tell us how we can configure a resource, pretty straightforward, when you're using kubernetes policies can specify things like what labels are allowed on a pod or requiring.

A

You know, images to have specific tags. That sort of thing now policy management is the mechanism that helps us with the ins and outs of a policy. So think of this as the framework, the runtime helping us manage or pull in external data packaging testing that kind of stuff- and the last part is policy enforcement, and it really refers to the actions that will be taken and the scope of those actions.

A

So again in the context of kubernetes, the actions will be things like allowing or denying admission to the cluster and the scope will cover types of objects like do. I want to focus this policy on pods or services or secrets or config maps, etc, and which name spaces.

A

I should enforce that scope and that policy on as well now policies are packaged as a set of templates and constraints and the reason we do that is because the template allows us to have a rule, and then the constraint allows us to enforce that rule in different ways or in different scenarios, getting us the ability to really really target and narrow the scope of how we want that enforcement to happen.

A

The policy management aspect comes from open policy agent right. That's a really broad and popular framework for for managing policy bits and the policy enforcement aspect comes from a sub-project of open policy agent called gatekeeper, gatekeeper, essentially, packages up opa, open policy agent and delivers it as a custom, kubernetes admission controller. So it's there to allow or deny admission to the cluster based on whether you violate a policy or not.

A

Now, where gatekeeper sort of sits is inside of kubernetes, obviously so as an admission controller, what happens is any incoming method, whether it's coop, ctl, git, ops, controller or api clients, as they submit new objects to the kubernetes api? The gurus api?

A

With this new admission controller in the form of gatekeeper checks with gatekeeper and says, this object wants to enter the cluster, how do we decide what to do with this object, and then it either provides uh it gets that requested and it provides a yes response or a no response, and it's just that simple.

A

When the enforcement happens right, gatekeeper, again, reviews the incoming object compares it to all the policies that are there. It checks for the namespace scope, the object, type scope and again the policy rule and whether that policy is just there for auditing purposes or whether, whether it's there to deny entry altogether and then it makes the decision, and it has it back to the kubernetes api and the kubernetes api then rejects admission or allows admission.

A

But, more importantly, crucially, the way this works out of the box today is: it only happens at deploy time, it's not happening at any other point in the process.

A

So as an example, if an ops team built an object that violated policy and that policy made its way through, let's say I'm sorry that that object made its way through all the way to prod the only time they would be notified was when the object was deployed to prod and not when they built it, which means there's a gap and there's a space of time there, where the enforcement wasn't happening and the team that was working on it is probably out of the loop on what that change was even about or the platform team has to go and figure out which team submitted this.

A

This config object that violates our policy and they have to go kind of trace back the lineage which can be done, but it takes work and again it slows down our release process.

A

So one of the things we talk a lot about in dora is this idea of shifting left on security, and so when we extend that idea to policy enforcement, we want to shift policy enforcement to the left. So we want that to happen much earlier in the development and the debugging process.

A

So that means when commits, are pushed. We can have enforcement happen right there. So, as you push a commit, there's a test that gets kicked off, that test comes back and says hey. This is actually going to violate a production policy. You have to go fix this. I can't pass the build whether it's your application, whether it's you know a deployment object for kubernetes. I can't pass the build until you fix this, because it's going to violate a policy and you can have that same approach. Work on a pr review as well.

A

So when a pr comes in the pr is automatically tested and says. Okay, this object or this application is going to violate a production policy, and you do that by having the infrastructure team or the platform team write those policies. Those policies are available for all teams to see and they're able to pull them in. So they get the latest and greatest every time they do a commit or do a pr.

A

They get the latest and greatest policy library, they're able to run their configure, run their change against that policy library and find out right then, and there, if they're, going to violate a policy well before they get to prod.

A

And if you go back to the example workflow I had earlier. If we take a look at that, when we talk about enforcing some of these guard rails, there are three main spots we want to do it right. I just mentioned one where we talk about commits and prs. That's at the continuous integration phase.

A

We should also have another policy, evaluation or enforcement element at delivery time, just in case to catch any last minute, things that might have bypassed an approach or come in through a different way and then, finally, we want to stick with the standard, opa gatekeeper approach, which is to run that at the kubernetes cluster, basically as a bouncer for the front door. So anybody that's going to violate policy through an imperative operation or some other api client also gets blocked right at the kubernetes door as well.

A

Now we covered a lot of ground, but there's one thing I want you to take away from all this. The most important takeaway through the whole thing is that this is not a one-size-fits-all approach right, doing, git, ops for cloud native with teams, it is a human and a process problem right it doesn't. There is no one approach that works for every single team. It doesn't scale to every single organization.

A

Instead, you want to take a very deeply collaborative approach and work with your teams early and often on documenting process and understanding, and that documentation could cover things like what is each team's role and responsibility and be specific if it's down to things like hey, this team only writes pod and service manifests or deployment manifests great. If this other team's only responsible for config, you know config map or secrets perfect. What we want to have is a clear idea for everybody in all these teams.

A

Everybody in the organization is if I need to figure out which team is responsible for which aspect of our application deployment. Who do I go talk to, and once you have that process understood, then you can design the tools to fit the approach you're trying to use to deploy applications.

A

Thank you all so much for your time today. I hope this was informative and beneficial, as always feel free to hit me up on twitter. If you need to that's at crcs mnky circusmonkey on twitter, thanks.