Argo ArgoCon 2021, 16 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ArgoCon '21-Argo-based service delivery for multi-tenant, multi-region clusters at Adobe(Aya Ivtsan)

Description

How do you enable CI/CD for hundreds of services, each with its own pipeline requirements and use-cases, all deploying to multiple clusters in multiple regions? At the Adobe platform team, we are using Argo Workflows, Events, CD, and Rollouts to create a flexible, customizable developer experience for CI/CD, where developers get a secure end-to-end pipeline out of the box and are able to tweak and modify it in a self-serve manner. Join us to learn how we enable and empower teams to manage their own Argo-based CI/CD pipelines across multiple multi-tenant clusters.

A

Up next, I'm joined by aya aya is a senior engineer at adobe working in the cloud platform team on adobe developer experience on kubernetes and ci cd workflows great to have you with us today, aya. Thank you.

B

Great to be here all right, hi everyone good morning or good evening, depending on where you are very excited to speak here at the very first arbolcon uh congrats to the argo developers and to the organizers and thank you for bringing us all together. My name is aya. I've been working at adobe for the past 15 years, and I've been here through adobe's transition from the desktop to the cloud. I'm working on our cloud foundation, platform called ethos.

B

Today, I'll tell you about our journey to improve the adobe developers experience and why we chose the arbo argo project for our service delivery or cicd solution, I'll be using the term service delivery and ci cd interchangeably throughout the presentation.

B

So I'll start by telling you a bit about adobe and its cloud foundation, platform ethos, we'll talk a little bit about the journey to kubernetes and about our existing ci cd landscape, how we started with a homegrown solution using evergreen main philosophy and how and why we are moving to use githubs with argo and finally, I'll tell you what's next for us.

B

So, first a high-level view of services at adobe, the three adobe clouds document cloud, creative cloud and experience cloud are comprised of various product services.

B

These product services are using several platforms like content platform, data platform and sensei machine learning, platform and all services, both product services and the platform services use ethos, which is our cloud foundation. Platform and ethos itself, has clusters running on aws azure and data centers.

B

Just to give an idea of the scale for ethos we are currently, we currently have around 200 clusters. The vast majority of them are multi-tenant.

B

There are around 26 000 tenant name spaces on those clusters and in those there are over 150 000 workloads running at any given time. We're averaging at around 3 million containers per 24 hours and we're doing around 800 builds a day and around 200 deployments throughout through our cicd solution per day.

B

Part of what is driving these numbers is our incorporation of security and reliability. Features in our platform.

B

um Let's take a closer look at ethos. Ethos is delivering containerized application to the cloud it is doing so by incorporating the 12 factor principles.

B

The platform is cloud agnostic and its main purpose of the platform, in addition to the management of kubernetes cluster and everything that entails is mostly to handle cross-cutting concerns. So, for example, we have blessed based images that developers base their own images off of. We have out of the box code generation and service that and service bootstrapping for java services, which comprise around 70 percent of all adobe services.

B

We offer a framework and libraries to do things like easily connect to other adobe services, logging, validations and things like that. We also offer kubernetes namespace provisioning with bootstrapping of things like secure network policies, quota limitations, we have a cicd solution and more on this. Coming up.

B

Observability and monitoring are part of the platform, cost efficiency tools and more.

B

So that was a quick glance at the adobe services, landscape and ethos. Let's talk about the ethos journey to kubernetes. It was started around 2015. Our clusters were running mesos dcos with homegrown tools.

B

We had a homegrown opinionated cicd with guardrails and an abstraction where developers could define the configuration for the service in a technology, agnostic and cloud agnostic way.

B

In 2020, we moved our entire fleet of clusters from mises to kubernetes. What allowed us to perform this migration with minimal disruption to developer teams using the platform was that abstraction and opinionated ci cd that I mentioned.

B

This is because we were able to implement the translation of the abstraction to kubernetes, manifests and use our cicd to automate the migration on a service-based service environment by environment basis.

B

The move to kubernetes gave us the following advantages. It allows us to leverage community efforts and participate in it to have less homegrown tools and easier maintenance, and we could now offer a second offering to adobe developers a kind of do-it-yourself option.

B

So now we have two offerings for adobe developers. We have the paved path and the do-it-yourself, which we can also think of as the wild west option, with a paved path developers get full abstraction of kubernetes full, automated provisioning, opinionated cicd with guardrails out of the box, but very little or very limited flexibility.

B

We'll talk about what this means in the next slides, whereas with the do-it-yourself option, developers get maximum flexibility, but they have to bring their own cicd. They get only namespace provisioning and there's no abstraction, meaning developers have to have a very good knowledge and understanding of kubernetes.

B

A bit more about our paved path or 1.0 cicd solution, which has been with us since the mesos days. It is implementing the evergreen, the evergreen main branch philosophy, meaning that changes get merged to the main branch only after they are validated in production, as opposed to githubs. As you all know, where changes get merged to main branch as a trigger to the automated deployment using the evergreen main was key for us at the time as guardrails to encourage adobe developers to do frequent deployments, while avoiding things like git revert, risks and messy situation.

B

It was also allowing them to do easy and safe rollbacks other all other guardrails. We have include enforcing code reviews, sequential environment promotion, image scanning and more, and we have the abstraction that I mentioned over the underlying technology. It's basically a yaml spec, where you describe your service configuration needs so that that's our current 1.0c icd solution.

B

So at this point, you may think this all sounds really great. Why even move to a new solution, so the evergreen main philosophy.

B

While it was helpful to encourage teams to do frequent deployments and in a safe manner, as adobe teams moved to containerized applications, it also proved challenging developers have to queue in the deployment pipeline which can slow down the pace of changes go into production.

B

In addition, our homegrown tools, together with a full abstraction, make it difficult to add new features fast. Specifically, some services, especially platforms on our platform, have complex use cases which are not supported by our homegrown ci cd.

B

Some examples for the unsupported features include cascade deployments, multiple pipelines per service and more and because we are using the full abstraction without a way for tenants to override it. Any new feature needs to be made available through that abstraction in our multiple ci cd tools, even if it's already available out of the box on the kubernetes cluster, this is causing slow progress for adding new features and is forcing more and more developers to turn to the diy offering and the diy offering having more teams.

B

In it is causing tool sprawl, so different teams have to figure out their own ci cd instead of being able to leverage the platform, and so we have some teams using their ci cd with jenkins others implemented it with spin, occur, circle, ci, etc.

B

So at this point the advanced use cases were piling up without a good way for us to address it. There's more user frustration due to lack of speed and acceleration, and we kind of reach the tipping point.

B

So this need for flexibility and speed has propelled us to work on our next version of cicd.

B

Our requirements for the new version of ci cd: we want to provide tenants with a paved path out of the box, with best practices, security and all the good stuff we had in our 1.0 solution, but we want to make it flexible and customizable.

B

Unlike our first solution, we want to be very open, dev friendly within adobe, with ability to share logic and implementation across teams. We want to use open source and be part of the open source community.

B

We want to leverage kubernetes the rich ecosystem. Kubernetes has we want to be kubernetes native and we want to use githubs. So, in other words, we want to address all the pain points were experienced with our homegrown full abstraction solution and hopefully also keep all of the advantages.

B

And so for our new ci seduce cdi cd solution. We are using all four argo projects. These four tools are small kubernetes native tool tools. Each one of them is focusing on a very specific task.

B

Each of them can work independently and possibly in future, be swapped up swapped out for another tool if needed, but they all work really well together.

B

Also, we're thinking we'll be able to use them in future for more use cases, for example, release of cluster components such as our api gateway, machine learning, use cases, serverless functions and more.

B

So this is a high level view of our new architecture on the left is github corp on the right. Are the clusters where tenants, deploy their services and application?

B

Note that these clusters are not connected to the corp network and therefore don't have access to github corp in the middle, we're bringing up a hub cluster in what we call our mission control aws account. This awps account has access to both corp network and all the clusters in our fleet.

B

We are using this hub cluster for running argo, workflows and events in tenant, specific namespaces and the hub cluster also has argo cd running in it, which is remote connected to all the clusters in our fleet on each of the remote clusters. We have argo rollouts, so we are implementing this kind of hub and spoke model for the clusters.

B

Let's take a look at what an example. Workflow looks like.

B

To set the context, we have the tenant gear triple on the left in the center. Sorry, the cluster, where the service and will be deployed, is on the right and in the center is our hub cluster.

B

The tenant git repos have three locations: the application code, the remote namespace with the desired state and the hub namespace desired state, meaning all the argo constructs, the sensors, workflows, etc. That will run in the tenant namespace on the hub cluster.

B

The hub cluster has argo cd, our way, events and argo workflows running and in the remote cluster. We have the tenant service namespace and we have argo rollouts.

B

We have argo cd applications connecting the tenants, git repo locations to both the tenant hub namespace for the workflows and events, and also to all the remote namespace namespaces, where the service is running on all the remote clusters.

B

And if we double click on the workflow, this is what the default one looks like.

B

So a developer makes the code change to the app code. This in turns triggers an argo workflow. Through our events, the workflow has steps to first build the image, scan it and push it to an image repository next, there's a promotion to the environment, and that is done by updating the image version in the desired namespace location in github, which in turns triggers the argo cd sync.

B

The next step in the workflow waits for the deployment to finish and then triggers some tests, optionally, on an external test, runner like jenkins, and next, we notify the developers that this step is done, uh possibly using slack.

B

These steps from the promote promote to n1 until notify can be repeated for each environment and region sequentially or in parallel on the remote namespace. If our rollouts objects are defined, which they are by default, the deployment will be going through a progressive delivery strategy such as canary or canary with analysis, which can be defined per service.

B

So that was a high level description of our solution.

B

What are we building on top of argo projects to make this happen, so default workflows with guardrails that developers get out of the box service or application helm charts with a desired service, name, space date, clcd for shared steps and we'll be going in more details over these in the next few slides, we also implemented an integration of argo rollouts with our own custom, ingress implementation.

B

We implemented ci steps, build scan and image push and we are doing the image build and scan directly on the cluster itself for secrets. We are integrating with vault operator.

B

We implemented a wait for deployment step. We implemented an active weight as a starting point since at the time when we started with this argo, cd notification was not yet in 1.0 version, but we do plan to use argo cd notification with workflow, suspend and resume pattern, and we're also investing in training, we're adding self-paced training and documentation to help service developers with adobe specific parts of the solution, while directing them to public docs. For the argo staff.

B

So a bit more about the default workflows with guardrails those are generated out of the box for developers. As I mentioned, all common steps- reference cluster workflow templates all manifests are helmified and synced to the hub cluster with argo. Cd default workflows are configured through values, files with sequential environment promotion and parallel deployment to regions by default, and the default workflows include things like image, scanning validation, step placeholders for integration, step tests.

B

Let's take a look at the partial helm chart for the default workflow on the left is part of the default. Workflow helm chart values, file where developers can specify environments and regions for their service.

B

This gets translated in the helm, template on the right to a workflow with sequential environment promotion where, within the environment, each region is deployed in parallel, so developers can just fill in the values file with their environment, names and regions to get the default behavior that I mentioned, or they can tweak the helm chart to get further customization, that's specific for their service.

B

All steps or workflows that can be reused across services are implemented by us, the platform team or contributed by adobe developer teams as cluster workflow templates and service workflow reference uh and service workflows reference to them like in this example here and I'll talk more about open development contribution within adobe a little later.

B

So until now we talked about the workflows that get applied on the hub cluster, but what about the actual running service?

B

The helm, charts that define a service or application? Namespaces are managed centrally in helm, repository by us they're referenced from developer repos as home dependencies and specific features such as scaling parameters, ingress, environment variables, etc, are all controlled through the values files, the values files have hierarchy for environments and regions and as part of the default work.

B

Sorry, as part of the default helm, charts tenants get things like argo rollouts, with deployment strategy again controlled through the values file and they get things like advanced scaling options out of the box like rps base, scaling, cubase scaling, etc.

B

So the sdlc, which I mentioned in a previous slide our vision for the shared steps and workflows, meaning the cluster workflow templates, is for them to be heavily contributed to and shared by adobe developers.

B

In order for this to happen, we must have a way to enable open development contributions from developers in an automated and secure way. So for this we are developing what we call the sdlc for shared workflows and steps, which is a kind of a ci cd pipeline for the shared templates.

B

Let's see what that looks like we have two hub clusters, a stage cluster which you can see at the bottom of this diagram and a product cluster on the right. We use the stage cluster solely for development and testing purposes and all the tenant workflows run on the prod hub cluster.

B

On the left, we have a git repo for the shared template, which is maintained by us, the platform team that repo has a helm chart for the shared templates and two desired state, one for the stage cluster and one for the prod cluster. Those are synced to the station prod hub clusters respectively, with argo cd.

B

Adobe developers, who want to contribute to the shared's depth open a pr against that ripple where we review it. Whenever changes are merged to main it triggers a workflow similar to the service cicd workflow, which packages the helm chart and pushes it to a repository.

B

It then applies the latest changes to the hub to the stagehub cluster by updating the stage desired state in git, which in turn triggers the argo cd sync and then the workflow runs some tests on that cluster.

B

And finally, it promotes the changes to the or the new uh helm chart to the prod cluster again through gate and rocd, and that's when the latest changes become available to all the tenant workflows.

B

So our journey to argo based cicd solution is currently on the way some of the things we're planning to tackle. Next we're productizing the hub clusters and rbos services, we're enabling more self-service like adding automation for failure, recovery, automation for provisioning, both for the initial provisioning and ongoing changes such as adding an environment or a region we're thinking of a unified ui, where developers will be able to see what are all the pipelines and namespaces for their service.

B

What is deployed where the history of all their workflows and deployments etc. We're also looking to improve the developer experience around the centralized management for templates and the remote helm charts with ability to override per service.

B

Going forward, we plan to add an abstraction, but as a lesson learned from our previous solution, this time will make the abstraction overrideable, meaning developers will be able to use the abstraction for features that it implements, but they'll be able to supplement or customize it.

B

Some of our initial thoughts on this is to have a step in the workflow which will translate the abstraction to kubernetes, manifests and then add, on top of those manifest any objects which users want to define explicitly.

B

This requires more thought and it's something we're planning to tackle next.

B

So we've talked about the adobe services landscape and about ethos, the cloud foundation platform about our journey to kubernetes and the challenges we have with our homegrown first version of ci cd solution, and then we focused on our active journey to an argo based cicd solution.

B

Some takeaways uh have a modular ci cd, give developers secure paved path out of the box, but make sure to enable a way to customize and override the paved path in order to enable fast progress, make it easy and safe to contribute to the platform as open development leverage. The industry build just enough on top of it and contribute contribute back, invest in kubernetes native tools and leverage. The rich ecosystem of kubernetes abstraction is important, but it must be possible to override it.

B

Otherwise it might might become a bottleneck for fast progress, as we have experienced.

B

So argo is a great fit for our requirements, uh we're building the missing the pieces. We need, on top of it, to make our solution work and we're we're, including the including the planned customizable abstraction, which we are planning to work on we're looking forward to working with all of you in the argo community to make it better for all of us.

B

Thank you for your attention.

A

um So we've got a couple questions from uh the community and when I say a couple, I I mean a ton and I know we don't have a ton of time.

A

So I want to ask you just some of the ones that are top of mind and then invite you to join the chat so that people can just directly ask you the questions that they have, but one of the ones that people are asking is um about secrets that you use for deployments, like sealed secrets or what kind of context can you give us about secrets in in this in this particular application?.

B

Yeah, so we are using vault uh at adobe, and that means both for the remote clusters so for runtime secrets, like secret environment variables.

B

You know if you have a custom domain name that you need a custom certificate for you will need the secrets for the certificate, the secret for the certificate key, um etc. So all those runtime secrets require secrets on their remote clusters where the service is running and we have vault operator um for that.

B

So vault operator is um as a service that is running in the tenant name space and it is controlled through crds that are called wall secrets and whenever vault secrets get created on the in the name space, the vault operator translates that into a kubernetes secret. So the basically the helm chart defines the secrets in terms of vault paths and field names and then those get translated by the vault operator to an actual kubernetes secret and we're doing similar thing on the hub cluster.

B

For things like build time secrets- or you know, you need a secret to push the image to your image repository. We are one other thing which is different between our original ci cd and the new one is that, with the new ci cd, we want to keep all the secrets to be provided by tenants. In other words, we don't want to share any secrets across tenants, which is something we had to do in our previous solution because of some technical limitations.

B

So this means that anything that the client needs to do that requires secrets like pushing to an image repository or you know, build time secrets. They need to pull some libraries during the build from any repository, artifactory, etc.

B

That means we also need secrets on the hub cluster for things like the build step or the image push step and again, we are using vault operator for that and what we plan to do, or what we are actually working on actively in this sprint, is to add the vault operator into our default helm chart. So it will be very easy for developers to incorporate the vault operator in their both hub namespace and the remote namespace.

B

I hope that that kind of answered the question I.

A

Think so, I think so, and I know we're right at times, so I wanted to uh give you a shout out since you haven't, read the chat, yet it was from thomas sebastian and he said that this is the best talk in a long time out of a lot of conferences. So the community is loving. This feel free to jump in there and have a chat with them and those of you that are still listening. Let's get back in the chat and keep this conversation going and we'll see you on the next one.

A

Thank you so much.

B

Thanks everyone.