Cloud Native Computing Foundation CNCF Webinars, 16 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Argo: Real Enterprise-scale with Kubernetes

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

I'd like to thank everyone joining us. Welcome to today's cncf webinar argo real enterprise scale with k-8s and we're saying that right, y'all, sorry, it's early.

B

A

Okay, I'm libby schultz I'll be moderating today's webinar. We would like to welcome our presenters today, al kimner principal software, engineer and architect at new relic daniel jimble staff engineer at new relic and caleb trotton product manager, telemetry data platform at new relic, a few housekeeping items before we get started during the webinar. You are not able to talk as an attendee. There is a q, a box at the bottom of your screen.

A

Please feel free to drop your questions in there and we'll get to as many as we can. At the end. This is an official webinar of the cncf and as such as subject to the cncf code of conduct, please do not add anything to the chat or questions that would be in violation of that code of conduct and be respectful of all of your fellow participants and presenters.

A

Please also note that the recording and slides will be posted later today to the cncf webinar page at cncf, dot, io, slash webinars, and with that I will hand it over to you guys.

B

All right, hello and good morning, good afternoon or good evening, everyone welcome um before we get started with the presentation. I want to uh have a word from our legal team, so this is our safe harbor slide. We can move on next slide. Please I'm al kimner principal engineer, architect. I help engineering teams build software and systems. That's simple to maintain and scale. My favorite hobby is scuba, diving.

C

Hey everybody, I'm caleb trotton. I have been an engineer at new relic for a while and I'm now a product manager with the telemetry data platform focusing on cincd when we're not in a global pandemic. In my free time I like to spend it in a bowling alley.

D

Hi everybody I'm daniel hindell and I'm a soft engineer neroli I engage with teams on temporary bases to help them with their projects and when I'm nothing from in front of a keyboard, I enjoy climbing and trying learning.

B

Cool and we're excited and grateful to have the opportunity to talk to you today about argo.

B

uh We have a packed agenda for our presentation day with two compelling demos, I'm going to give you an overview of new relic's ingestion, streaming and storage architecture, which should set the stage of what problems we have and how argo fits into that I'll cover our use of argo cd and the scale at which we're using it. Caleb will walk us through how argo rollouts give us a better gives us a better experience than a kubernetes rolling update for a deployment and showing and showing a demo of the deployment with an automated canary analysis.

B

After that, we'll cover additional needs. We have with orchestration at scale and how argo workflow helps us there. Daniel is going to cover how we use terraform an open policy agent with an argo workflow to safely roll out infrastructure as code changes. Our main objectives today are that you'll be able to understand how to safely implement continuous delivery at scale for both kubernetes resources and infrastructure. As code pipelines.

B

Here's a high level diagram of the telemetry data platform at new relic. The parts of the diagram are made up of approximately a few hundred different microservices.

B

This part of the platform supports ingesting many petabytes of data a day from all over the world, with millisecond response times for the querying of the ingested data.

B

I point out both the amount of data we ingest and the response time on queries for the ingested data, because all these services are dealing with an incredible amount of throughput and we are also continuously deploying updates all the time this chart shows the data growth over time for the period that engineering teams on the telomere telemetry data platform migrated these services to kubernetes, which brings us to how does new relic do continuous delivery in a distributed system that ingests such a massive amount of data safely without interruption.

B

This is where argosidi enters the picture. First, some history about argo. Argo was created in 2017 at athletics, which was acquired by intuit in january of 2018, who open sourced argo a few months later, blackrock contributed argo events to the argo project. Argo then joined the cncf in april of 2020..

B

So why argo? Well? At new relic? We are constantly evolving our systems, along with our internal engineering processes and operations. One of those changes was introducing kubernetes kubernetes was a good fit for us because we have been multi-cloud for many years and our services exist across multiple public cloud providers and private data centers.

B

In many different regions, kubernetes was a natural fit and since all of our argo tools and since all of the argo tools are implemented, as controllers and custom resources in kubernetes argos seemed like a good fit too, which led us to look at argo cd.

B

This is a long list of features that made it compelling for us to pick argo cd for our continuous delivery needs, and this is not even the full list. I just ran out of room on the slide. One of the main drivers was for us to have the ability to easily manage and deploy multiple kubernetes clusters with a get ops workflow.

B

We have lots of kubernetes clusters and we're creating more and more all the time. So argo cd was the easiest way to get up and running and checked all the boxes for all the things we wanted. All in all, we've been extremely pleased with it.

B

A bonus as of last week, argo cd version 1.8 just shipped and added horizontal controls, controller scaling. So now we can have even more kubernetes.

B

Working at a company that focuses on observability, it would not feel right without sharing some stats about our current argo cd instance. We are at approximately 3 000 applications and over 10 000 kubernetes deployments in the last month. The kubernetes clusters are very big with most over a thousand nodes, we've segmented, our services into different workloads and workloads are assigned to different size, kubernetes clusters, I'm pointing all this out, because we have lots of variables with dozens of internal engineering teams and a whole bunch of services and lots of changes.

B

Argo cd makes the rate of change possible, but we're still missing the safe part of this continuous delivery story, and now caleb is going to talk to you about that. Take it away. Caleb.

C

Thank you al. uh So I'm going to talk to everybody today about argo rollouts and how that helps us with the safety of so many deployments happening every day.

C

So with hundreds, if not thousands of deployments a day, we need a way to make sure that changes roll out safely and don't require an extreme amount of effort from engineers to make sure that the deployment works right for this. We like to use a canary deploy strategy if you're not familiar with the canary deploy strategy that involves.

A

C

One or some small number of instances with a new change waiting some amount of time to make sure that new change is safe and then once you've determined that it is safe, rolling that change out to the rest of your instances.

C

This can totally be done manually in terms of verifying whether the canary is done safe, but we don't want to have uh you know a human involved for 30 or 60 minutes. Every time a deploy is made when we're making hundreds of them a day, so something that we were really looking for was automated canary analysis.

C

Automated canary analysis allows you to query some metric provider and use that to determine whether the deployment of your canary is safe and then, in the case that it is, it detects that there's something wrong with the metrics can automate a rollback.

C

So we looked at argo rollouts for this, because the standard kubernetes deployment resource doesn't provide most of this stuff that you see here. Their rolling update strategy allows you to roll out things slowly, one at a time as long as probes, as long as the probe conditions are met, but not really that advanced use case of stopping pausing running analysis in more granular steps.

C

So argo rollouts um does provide this stuff for us and I'm going to walk you through now some of the pieces of argo rollouts and why it was compelling.

C

First, I want to talk about the custom resources that argo rollouts provides.

C

We're going to start with some of the lower level stuff, which are the experiments and analysis templates and analysis runs, and then I'm going to cover the rollout resource, which basically encompasses all of these things, into one drop-in deployment replacement.

C

And then I'm going to talk a little bit about all the metric providers that can be used with analysis templates to do the automated canary analysis and then, finally, I'm going to show a demo of all of that coming together.

C

So let's talk about experiments. First, an experiment at its core creates uh two different replica sets.

C

Each of those replica sets has its own pod spec template, so you can deploy uh something with as small of a change as a different version in your docker image, or you could have two wildly different pod specs, it's up to you.

C

What you do with those pods with just an experiment is up to you, they'll be run and you can go poke them. However, you see fit but where it gets a little more interesting is, when you pair an experiment with an analysis template to be run against those pods. So an analysis template describes three metric providers.

C

What to query and what a successful query looks like.

C

From there, the experiment will use the analysis, template and initiate an analysis run. uh An analysis run is really just an instance of that analysis. Template with arguments filled in uh typically with information about one or more of those pods in your stable or canary replica set.

C

And finally, what what most folks actually interact with in argo rollouts is the rollout resource. Like I said this is a drop-in replacement for deployment. If you didn't use any of argo, rollout's advanced features and didn't specify like a canary strategy, you could just use it like a drop in deployment replacement, but you're not going to get all the goodies in it until you um specify a special strategy like canary or blue green deployments.

C

um Our gorilla supports blue green deploys, which I'm not going to cover today, but I encourage you to go google it and see if that is uh something that you're interested in as well.

C

So on the right here, we have um a really basic uh example, uh mostly taken from the um argo rollout stocks, uh showing that um most of the spec is just like a deployment. However, what we see here is an alternate strategy with canary in this example, what we're doing is uh deploying 20 of the instances first, pausing for five minutes and then running a one-time analysis uh using the success rate, analysis, template and passing in an argument with the service name of the service being deployed.

C

The analysis type that's used here of running a one type analysis or a one-time analysis uh at the end of that pause duration is one way that argo rollouts lets. You configure analysis to run, but there are other ways, including running analysis, in the background, uh the entire time that your canary steps are progressing.

C

Last before I get to the demo, I want to talk about all the different metric providers that you can specify in an analysis template. So first you can run a kubernetes job. This would instantiate a kubernetes job on the cluster and uh just look for the exit status of that job. If it exits zero, your job is successful.

C

If it's non-zero it failed, you can fire off http requests against any endpoint, and as long as it returns json you can specify what json path and what values you're looking at in that json payload and which ones you consider failure and success.

C

You can also integrate with a tool called kyenta. Kayenta is traditionally a piece of the spinnaker ecosystem, although you can run it stand alone.

C

The main thing that cayanta does is perform a different type of statistical analysis on the two different groups of instances that you're, comparing called the man whitney analysis, google, that if you want to learn more about statistics- and this can be integrated with and kyensa, has its own set of metric providers that you can configure there too.

C

So also, you can directly query different metric providers. uh Prometheus, if you have prompt ql queries that you want to run as well as a number of commercial providers that includes ourselves new, relic, datadog and wavefront the demo, I'm about to show you is going to be using new relic as the metric provider, because that's what we use here.

C

uh So let me pop out of the presentation and start showing you some stuff. I'm gonna start uh first with a rollout resource. So what I'm gonna demo is. I have an argo cd application with two resources in it, a rollout and an analysis template this is the rollout uh you can see. We have five replicas and then we're going to have this canary deploy strategy where we deploy 20 of those resources aka one instance uh we're going to pause for uh only 20 seconds.

C

uh This is a demo on a webinar and I want to keep it a bit snappy and then we're going to run a one-time analysis against the error rate of the application, we're going to be passing in a an application name, which is webinar demo app, the canary hash. This is a something provided by argo rollouts.

C

It is the replica set identifier segment of uh the pods name, um and the latest value here basically says give me the pod template hash from the canary group- and this is uh just another argument into our analysis- template saying how long to run our query for in our metric provider so against new relic.

C

The rest of this spec looks exactly like a typical deployment. Spec. We have an image. We have environment variables. Most of these environment variables are just hooking up the new relic agent, and then we have one environment variable that takes the rollout pod template hash and using kubernetes downward api makes that available as a environment variable in our container, which we are then using to add to the instrumentation by the agent so that we can pick up uh in our transactions whether a given pod belongs to the canary group or to the stable group.

C

This error rate analysis template that we're referencing. Here it takes four arguments: it takes application, name the canary hash. We saw that in our rollout. We have this, since uh this is defaulting to one minute. So any rollout, using this analysis, template for any of these arguments that have a value there.

C

That value is the default, and so you don't have to pass it in for these val for these arguments that don't have a default you're required to pass in a value from your rollout, so we're using 20 seconds as the sense here, the error threshold we didn't pass in we're using the default that is 1.0, which is a one percent error rate threshold, uh we're specifying a failure, condition, which is that the error rate is uh greater than or equal to the threshold.

C

So this will fail if the error rate goes above one percent and then the query that we're giving to new relic uh is uh the shorthand of it is what is the error rate of this application with this pod template hash?

C

So in our example, we are saying what is the error rate of the canary pod that we've deployed.

C

So let me jump over here to argo cd for a second to this application. I have this rollout running already. I am going to make some changes to it now and I would uh caveat I would typically do this in a more get ops fashion.

C

But again this is a demo and I want to keep it snappy, so we're just going to edit the manifest directly to simulate a new deployment, I'm going to deploy, release 4 and we're going to see what argo rollouts does with this you'll see, first, uh that it spun up a new replica set with one pod and scaled down the stable replica set to four pods. So we still have five pods total running.

C

In a second, what you're going to see here we go is an analysis, run was executed and very quickly. We see that the canary replica set uh has been scaled up to five pods. um The old replica set is scaling down to zero.

C

um So if we take a look at this analysis run, it's all green, so you're not going to see a ton of information up front, but if you dig into the manifest you'll notice that we have an error rate of zero at this time,.

C

So let's go the other way: let's, uh let's deploy something bad, so I have a version of this application that uh boots up completely fine, but there's a bug in it that causes um all of the background processing that it does to air. So, let's deploy that version.

C

Again, just like last time, we get one pod on a new replica set here. Our old replica set scales down.

C

I'm going to show you something while we're waiting here, which is um this is a new relic, and I have a query here that is showing basically the same thing: the error rate for this application and we're already seeing that this has just recently spiked up.

C

If I jump back to argo cd, you're going to see that another analysis run uh occurred uh and the newer replica set rev6, uh because the analysis run failed scaled down, so it automatically rolled back and the previous stable replica sets scaled back up uh to its full five instances.

C

If you look at the analysis run, we get some events. um The analysis failed and specifically the metric error rate failed, uh and if we look at uh you know some of the data behind the scenes we have. You know a full 100 error rate. Of course we want to roll it back.

C

So that's the basic demo of how argo rollouts works again.

C

This kind of metric analysis, automated analysis is, is really important to us with the scale and the sheer number of deployments that we're doing in a day um again hundreds, if not thousands, that's uh not the kind of uh time and attention that we want to force engineers uh to pay um we this is, uh you know, really allowing us to continue to move fast uh while making changes in a safe manner.

C

um So I'm gonna kick it back to al now to talk a little bit more about this scale and how we orchestrate changes to it.

B

Cool thanks, caleb, uh that's a pretty compelling demo. I hope uh everyone. um This shows you exactly how to implement safe, continuous delivery using our rollouts and how we do it too.

B

In this section, I'm going to dive into the additional needs we had with orchestration at scale.

B

So, there's a few issues with orchestrating at this scale that new relic has. How do you mix it? How do you safely make changes? How do you add capacity, and how do you isolate failures?

B

A common approach is to scale out deployments to other regions, but this doesn't work if your applications are sensitive to latency and you need to be close to where your customers are. It might seem straightforward to just keep scaling out your kubernetes clusters by adding more nodes, but that's not actually a good practice up to a point. um You know our clusters are already thousands of nodes, so you know it seems like we need another mechanism.

B

We need to create some boundaries to limit failures in our systems.

B

This high scalability post highlights the next set of changes we introduce, which is a cellular architecture we needed to paralyze and isolate by sharding. Our data set are the slides moving forward. There we go next slide.

B

We definitely have a need to support incremental capacity, which cell architecture does very nicely being able to isolate. So one cell does not impact. Another cell is a huge benefit to our operations.

B

This allows us to continuously deploy changes to a small subset of our cells without impacting all cells incorporating a cell architecture into the automated canary analysis, deployment that caleb shows means we can have a really high confidence. Our changes are not causing an issue. Not only are we doing canary analysis inside of an application deployment, it's also now inside of a cell, that's isolated from our whole environment.

B

So the telemetry data platform looks like this inside of a region where we can just keep adding cells as we need more capacity in isolation, and we end up with n number of cells. This architecture can be applied to any number of applications. You just have to look at how you route and charge your data set.

B

Next slide, this is the same data growth chart I shared before, but now it's faceted by cells, the large blue area at the bottom. That cell now serves a significantly smaller amount of overall capacity.

B

There's also a few interesting visuals in this chart in the top left, the purple and grain cells actually taper off and disappear completely in the middle of the chart, a whole bunch, more cells start appearing. This is where we started using argo workflows for our orchestration of cell builds.

B

The amount of orchestration at this scale requires us to have a flexible orchestration systems that many teams can interact with, and that's where arco workflows comes in argo workflows is perfect for this. In the system in the systems we just talked about moving to cell architecture, we have approximately 20 teams involved.

B

Each team needs to deploy their services, and most teams need a combination of steps like creating infrastructure resources, like databases and s3 buckets and other things via infrastructure, as code pipelines.

B

They also, then need to deploy their applications to kubernetes that depend on those resources, and we want to automate all this and we want it to be safe and to happen continuously, for each team creates one or many workflows.

B

However, they like and the different steps they need to get their service deployed to a target cell, and it really boils down to since a step is a container. It gives them a really powerful abstraction. That's easy to maintain.

B

Next slide, one of the nice features in argo workflows is that the workflow can call another workflow or also a workflow template.

B

So here we created a cell build workflow that each team has added all their own workflows to again. The team workflows are everything they need to deploy their services to a cell, and we really stress to the teams that all the steps are out of potent. So we can rerun the workflow safely.

B

This parent workflow here is actually a directed acyclic graph or dag sorry for the tiny image, but the point of showing you this was so that some serial steps happen at the beginning and then some parallel steps happen later on as the dependencies are resolved in the dag.

B

The steps at the beginning are actually an argo workflow template calling terraform and then we start deploying kubernetes resources using argo, cdn, argo rollouts.

B

So I hope I've done a good job explaining our orchestration and scale and how argo workflows helps us we. What I haven't explained is how to safely run auto applies of terraform, which daniel is going to talk us through and how that works and show us a cool demo too. Take it away.

B

D

Stop sharing because I'm not able to take this thing. Thank you.

D

I'm going to talk about how we have uh implemented our telephone pipeline using our workflows when we start with our product of concept, to integrate our existing telephone code into our goal. We had some requirements to accomplish. We had to use turbo workflows to run terraform every step had to be hidden. Potent that way can be run multiple times if needed, without affecting the current infrastructure.

D

The solution has to be generic for all teams. That way, it's easy to lock for them and everything avoids reinventing the wheel over and over, and we have to make sure that terraform can run without human interaction and trust in what is going to be applied.

D

uh Our workflows help us because an arrow overflow is able to trigger another workflow.

D

We can use artifacts between steps to pass data, for example, and all workflows can be launched with different inputs such as surname in our place.

D

Then our workflows run shadow images under the hood. We have to create a new uh image for us to fit our needs uh using terraform, tfm, opa and comp effects. Also the documents use different inputs and it doesn't require more interaction and terraform.

D

We had already some existing telephone codes where we were previously creating our infrastructure with another angular pipeline, but we had to do a few changes to make it even better. We have to. We have to switch to their homework spaces, so we don't have to duplicate the code for every cell when the code is always the same and it only changes different variables, and if we need to overwrite some values for a specific for a specific cell, we can specify them on a df bar file.

D

uh The directional workspaces gets related automatically if it doesn't exist, as it can happen for new cells.

D

The open policy agent covers the need in making sure that the traffic code that we're applying won't do any underserved changes such as deleting a kubernetes cluster. For example, it uses a real query language where things can specify. There are acceptance policies and if theratron plan doesn't pass the open, the opi policies, the processes cancel and it access without error, I'm going to show I'm going to quick demo.

D

So this is a workflows interface, and here I have some terraform code. What I'm gonna try to show is like uh we are gonna create or like we are simulate like we are creating a cover in this cluster.

D

Then we are deploying uh some apps into that cluster and in that only we are creating an s3 bucket if we go to the ntf, I'm just using another source, because I'm going to spend time creating a new cluster right now. But the interesting part here is uh this: transform rail file, which is the one that allows us uh that allow us to make sure that we are applying things in a safe way.

D

For example, uh we have this new resource where we are setting different ways. uh If the plan is gonna delete another resource, it says that it's gonna have a hundred points if it creates a new one, it's gonna have ten points and, if modify one is gonna add one point: we see that our plus radius is 30. So if uh the overall calculation of all the resources are over, 50 is gonna be cancelled.

D

Automatically the plan.

D

And the arrow google flow that we're using is the virtual template, where basically, we launched the terraform container.

D

uh We are cloning, the repository to to get to all the all the data, all the files, and then we are specifying the the docker image.

D

Here we have some secrets that are already on the kubernetes cluster that allow us to clone the the repository, and then here is where we launched the docker image and we are passing uh the values to the different environment variables that we need. For example, we need to pass that data from directory where the code is the telephone version. If we want to force one otherwise, it will detect our terraformation written in the code and we'll download it automatically.

D

If it isn't already present on the local container, then we specify the workspace, which basically is the name of the cell in our case and where the opa file is also, we can use concepts as well and then the action that we're going to apply that can be planned applied. Slowly then we have our uh aw access keys that again are stored on the cluster that allow us to communicate with.

D

Anyways then uh this workflow is uh um in here I'm simulating the one that I'll show previously the the huge one. So basically it's a dac and I use the stacks, so I have three tasks here. One task is a uh that creates the coordinates cluster and it uh refers to the workflow template that I showed previously.

D

We are going to pass only different values, which is the workspace, in our case, the cell name, the psd that we want to apply in our case is that directory where our opf file is and then the action that is applied. Then the second step is our deploy apps. In that step, we can see that castle dependency, so that set will be launched after the previous one is finished, and the only that it changed is a psd and here we're using a contest conference instead of opa and then for that create s3.

D

We don't have any dependency, which means that uh at the beginning, both and the kubernetes cluster and creator 3 are going to be run on parallel, so go here.

D

First of all, I'm gonna.

D

Show that we don't have any template, so the what we need to do first is upload the template into our cluster.

D

So I'm going to run a template, create and I'm going to specify where the template is so again now, if I leave certain place, I can see that I have one. If I go for from the interface and to work for templates, we can see it also here then, let's create a new.

D

Cell, so I'm gonna submit um the the workflow in the workflow channel. In that case it's just a workflow, not a workflow template and I'm gonna pass a parameter, which is that's the name.

D

I can check the locks from the console we can see. uh Different colors uh in our case means that every color is a different set or a different container, or we can go on the web ui which it's analyzer.

D

We can see how it created the kubernetes cluster.

D

So it's detected that we are uh using director from 1279.

D

Then the workspace webinar demo doesn't exist, so it creates uh automatically the workspace.

D

It says that it's going to create a new resource and as the opa is only 10, because it's creating any one it passed, the phx multiplies safely the code.

D

Then it deploys the apps and the same for the create s3 bucket.

D

So now let's go to the code, and imagine that I say: oh I'm going to change the restore name and I'm gonna add you if you are familiar with uh perform. You know that this operation, basically what it's gonna do, is gonna destroy the previous cluster and it's gonna create a new one. So uh change name commit the changes.

D

I'm gonna trigger again the same step or the same workflow.

D

We see another workflow.

D

And if I open the government of this cluster.

D

Now it detected already the network space because it has been created before and it says that this tries to it's going to destroy a cluster and it's going to create the new one that we have set. So it's 110 that the total score, because it's 100 for deleting a new result for deleting resources and 10 for adding one.

D

And then it says that it failed the opi checks and it cancel the operation and as it cancels the operation, we can see that the deploy apps uh set didn't run and that's basically how we are using uh our workflows with ultraphone.

B

Cool daniel, that's a pretty sweet demo. I definitely showcases how we use terraform and opa with an argo workflow to safely roll out our infrastructure changes. This fits nicely with what caleb showed combined argo, cd and argo rollouts, so we can build new cells in a safe and automated fashion without any degradation of our service at scale.

B

Now we're going to leave the rest of the time for questions and no a hot dog is not a.

B

Sandwich so let's see I'll try to go in order here. I think.

B

We had a question about crds. I think it was uh related to machine machine templates for kids and how you manage the comp complexity of that simultaneously. um We do that with our go. Workflows essentially we'll have a oh. You know a step in a workflow that uses uh argo cd and argo rollouts to push out.

B

You know crds, and then we can run any validation that we want, and then you can have another step that you know in the workflow that does whatever you need to do with machine templates um as and you know, those are treated as code as well.

B

The next question: um how would a list of applications and application details uh ui look with 200 microservices? Is it a drill down ui, or is everything going to be on the same page caleb? You want to take this.

C

Yeah, I can take that um since we deal with having you know: 3000 applications in one argo cd, um so any individual application looks like what you saw in in my demo, where it's scoped to just the resources that that application controls the list of applications is is one big list, but it is filterable, so um you can group them together by by project which is uh at least in our case.

C

We typically tie that to like a namespace or a team, um and then you can also filter it by like which cluster you're, targeting um you can put arbitrary labels on those applications and filter by that um kind of a robust filtering system there. So it's not exactly drilled down, but it's like a filterable list.

B

I hope that answered their question.

B

I think the next question is from muhammad. A good habit with cicd is performing a dast, which I think is the dynamic security testing. How can we perform these tests to canary release.

C

I can take that too, because um I think it probably depends on your your das tool, but um I can imagine uh using either a job metric provider in a canary release or a webmetric provider and either running a kubernetes job. That goes and asks your das tool to go, do a thing and then uh inspect the results.

C

So you know run that scan. Look at the results see if there there's bad stuff there and if there is uh fail that canary.

D

B

um Are all the examples from the demos and any public repo?

B

Unfortunately not, um but I think we we. We will probably create a public blog post on new relic and be able to share, um share a lot of the content.

C

I think the next one is related to an answer that I gave asynchronously yep. I see that cool. Let's see.

B

ah Okay, how do you manage destroying resources if the workflow is cancelled?

B

uh It kind of depends what the step in the workflow's doing, um if you're using rollouts then you're in good shape, because it automatically rolls you back for the terraform stuff, we showed it essentially will if it passes the opa checks and then it fails for some reason, then you know you're going to get alerted and have to dive in to see what's wrong.

B

If it's another step, that's running a container that an engineering team has created, it just depends on what that is. You know it doesn't have that rollback functionality unless you built it into the step itself.

B

Hope that answered your question.

C

Are we running an argo for each case cluster or yeah or or do we have one that manages all of them? So um I the short answer is we have one that manages all of them? um The real answer is we have two that each manage all of them, but we definitely take the um like singular approach. The only one of those components that is deployed to every cluster is the argo rollouts component, because that is a controller that needs to run on every cluster. Everything else is centrally managed in one.

B

Yeah and that's where you know you get to a certain number of clusters and apps and you really need the 1 8 release that was just released last week.

B

So does argo provide notification hooks for successful deployments, it does actually and one of the components in argo we didn't cover, was argo events which has its own huge, powerful ecosystem for really creating event-based pipelines to to react to anything going on from many different sources, not just argo itself, but you could watch kafka topics and pub sub.

B

You know and environments, and things like.

B

C

Which one are we tackling next.

B

I think the one from alexi- probably um how do we create another environment for the application as new instances of the application that consists of hundreds of services and components and additional dependencies, for example, it might require to create a pv resource and, may you know, rely on centralized instances of like rabbit and q. Can you copy application to create new instances, I.e copy dev01, test04, staging etc or create one from scratch? Every time.

C

I have opinions yeah.

B

C

It sure um like this is why we do get ops um where creating a new version. So especially, I think with the stuff that we're deploying a lot of these are helm, charts um and uh argo cd handles uh like helm and customize and other um renders uh very well.

C

um So you wouldn't necessarily be creating one from scratch, but you would be creating a new application with the exact same helm chart, but with different parameters for your different environments and that's how you deploy a new one. Yep cool.

B

uh Here's one for daniel, uh where are you holding terraform state, for your workflows,.

D

um Well, we are storing it on on s3, but for the demo I was using minio on local, so yeah, I'm going to spend a lot of time like quitting the s3.

B

Yeah, so each each workflow that a team owns uh has their own state backend and an s3 bucket.

B

I think that's using dynamo for locking as well, but it's anywhere terraform really can store it, but we definitely like making sure our state is pushed up somewhere. So multiple people can interact with it.

B

B

So this one's interesting are there any features to debug build container inside like circle. Ci has.

B

That one, I'm not sure of I think, you'll be able to look at the logs. But if you need to you're gonna, try to like init into you know or execute into the container, uh you might have to introduce your own like pause mechanism there or something.

B

So you can. You know, look at the what's going on.

B

So this is a good question: does it compare versions of kubernetes resources against what argo created manual edits of things like emvs, resource limits, requests, etc? So yeah? It has a sync component to it so like if you edit, the state of a resource that is managed by argo like fargo, cd or argo rollouts.

B

If you edit, that manually with cube ctl um it'll, essentially be out of sync with the the git repo that's coming from, and this is where, like get ops comes in where, if that happens, the next sync will essentially blow away those changes.

C

I can actually show this real quick, um because I'm out of sync right now um in in that application that I have demoed, let me let me do this so.

C

So, like you know, I made some manual edits to that rollout through this ui, and it now tells me that I'm out of sync with what's in git and and you can get this diff um here, let me show the compact version. You can get a diff about like what has changed here versus what is in the get repo backing this application, um and somebody had asked about notifications earlier, like um we.

C

We specifically use the argo cd notifications um project, that's in the argo like labs, org, um and um that will send out notifications when, like into slack when a application goes out of sync for any reason, which is really.

C

B

Let's see, we've got a couple more questions, libby. How much time do we have still.

A

We have about seven minutes: okay,.

D

You are good to.

D

D

Yeah that one says uh in your github workflow how to how do you manage credentials like required by argo to access gift repositories? Are they stored in plain text? This basics fall in their repository itself, so um we're using uh secrets, uh coordinated secrets to store all kind of credentials and then with uh vargo cd or google flow. It's really easy to to get them. It's just yeah. It's like uh when you define a bot that you want to use a secret. I think it's exactly the same definition.

D

Like yeah yeah, you say like the value from secret key reference and then that's.

D

B

Does argo have ldap and 80 integration yeah? This was on one of my slides. uh It actually has that sso integration and you know very granular, our back controls that you know we'll let you have multi-tenancy and things like that and you could even deploy you know different argo instances to the same kubernetes cluster if you really needed to um in their own name spaces and have some granular control of what teams can access and stuff so like that.

B

I have one for you caleb. What other tools did you test before settling on argo.

C

That's funny and I was laughing because I spearheaded a effort to test a whole bunch of tools on the way to coming to argo.

C

um uh We looked at some commercial providers um we previously had, uh and so we looked at harness, we looked at. um We looked at tecton. uh We looked at a number of other, like pipelining tools, specifically in the argo workflows space, so like go cd, and that list is a little long. We previously had experience with spinnaker, so we had done an evaluation of that.

C

It all kind of boiled down to this collection of tools, each solving specific needs that we had very well in their own like pinpointed way like so argo cd, um you know, did the get ops and the syncing and didn't try to do anything else. Didn't try to be like an all-in-one tool, but our go workflows did the pipelining and it did that really well, but it didn't.

C

You know, try to be everything and we found that flexibility to meet all of our needs um and together, you know like putting these three things together um drew a really nice overall.

C

A

All right we have about three minutes. You want to go tackle another one.

B

Yeah- let's do it I was just reading through through so I'm trying to figure out which one I can answer next and which ones I've already answered.

C

See I can answer one of these, so how do you set up and configure aggro cd? Do you use cluster bootstrapping features app of apps, um so yeah the the team that manages the argo installation for all of this stuff? It's all managed through a git repo that is synced through argo cd. So there's one bootstrapping step to get argo cd like up and running the first time, and then it's just a an automatic sync of all of the argo components after that, specifically using the app of apps.

C

B

So here's a question for do you? Do you think argo could be utilized for setting up base apps of the kubernetes clusters, like a paths layer of the application that we use in our infrastructure, logging, file, b, plus log, slash ingress controllers, metal, lb for bellmatter clusters, storage, class definitions for cloud providers etc? Or do you think argo is a good primary primarily for application state management?

B

I think it's both really. um You know this is where the get ops workflow really comes in so like the source of truth is really in get for all this and argo is just the intermediary of you know, applying those changes and that really lets us have confidence of you know what the changes were. Who made the changes?

B

Somebody reviewed the changes we, you know you can have different environments for testing those out and then you know argo is, you know the delivery mechanism, but it's also doing stuff, like caleb showed with rollouts, which is giving us the metric based analysis for canaries and other things, and then with workflows orchestrating it all together. So we can kind of do anything we want with it.

B

In some regard, I think one of the important things is like we really push that all the things we're we're deploying are idepotent. So that way we could rerun the steps over and over again. If we need to right, like it's safe to just say, hey they run and they will not make any changes if they don't need to.

A

All right, I think, that's all we have time for uh thanks everyone for joining us I'll remind you that the slides and recording will be up later today on the cncf website and um thanks again for joining us. Thank you all for your presentation and all your q a, and we will see you at a cncf webinar very soon,.