Argo ArgoCon 2021, 10 Dec 2021

Previous Meeting

⏯

youtube image

►

From YouTube: ArgoCon '21: Argo CD and Stateful Applications - Tips and Tricks! (Christian Hernandez)

Description

Argo CD is widely used as a tool in GitOps workflows. Part of the appeal of GitOps is the recovery after a disaster can be expedited. Reinstall the cluster and reapply the manifest is the high level instructions, but is that all? In this talk I will go over what to keep in mind when planning your recovery of your stateful applications.

A

We are now joined by christian christian, currently works on the customer and field engagement team in the hybrid platforms, business unit within red hat he's a technologist with experience in infrastructure, engineering systems, administration, enterprise architecture, tech support and management. Christian is passionate about open source and containerizing the world one application. At a time. Lately he's been focusing on kubernetes, microservices cloud native architecture and get ops practices christian. What do you have in store for us today?.

B

Hi, my name is christian hernandez. I am a single senior principal technical marketing manager at red hat. um I work on the get ops team. Openshift get ops team which uses argo cd as its continuous deployment tool set for openshift, and uh today I'm gonna be talking about argo cd and staple applications.

B

um Staple applications is something that um is always a challenge and I'm here to share some of the tips and tricks that I've learned along the way working with staple applications and argo cd anytime. I do a talk. I always like to start off with a good quote, and so I found a great quote from christian posta, who is currently the field cto at solo.

B

I o former red hatter and he did a presentation on um data presentation at red hat summit in 2017 about microservices, and this is he explained that the hardest part of microservices is your data and all those this talk talked about the sharing data between microservices.

B

It did touch on an interesting subject about how that it's a challenge in cloud native computing and with you know in your devops methodologies and your devops practice that um is is is always there and it's always a challenge of of specifically nowadays to cloud native computing, and that is your data and when I say your data, I specifically I'm talking about stateful applications, and so I like to um I like to borrow slash, amend christian post this quote and by saying that the hardest part of get ops is your stateful applications.

B

uh One of the many benefits of get ops is that it reduces the time to recovery right, that's one of the things that um uh that you get by doing get ups practices and the idea is being you just reapply the manifests um and you're back in business right. You have an outage. You spin up a cluster, you reply, your manifests and you're back up right.

B

That's that kind of the whole idea and, as we all know and hope, hopefully none of you have um been into any serious problems, but it's a little bit more complex than that right and you know you always think about what about your staple applications. What about what was the data lost? uh um uh Were you hosting a database?

B

Do do we need to recover data from a backup? um What was the application storage pointing to exactly right? Where was it? What was it consuming? Storage so, um depending on what happened, you may need to take a different approach on restoring those. So um so just a point of reference in this talk, I'm going to be focusing on recovering from an outage, specifically right so failing over to another site in general. Disaster recovery is kind of out of scope for this.

B

uh For this talk, but um you know I just won't have time right, it's a big topic, but hopefully this can give you ideas in preparing for those kind of scenarios. So um the challenge of stateful application isn't something new, as I said before, but it is, you know, a general problem, and hopefully you can start formulating um some um disaster recovery uh processes around what you learned today. So um first I like to talk about storage with kubernetes from a high level. Just in um from a generic standpoint.

B

Storage has always been a challenge in kubernetes it's from day one right. So, ideally, all of our applications will be stateless, but that's honestly, not the world. We live in uh in the world of I.t, whether you're doing cloud native uh state slash data is always at the center of everything. um It's no different with kubernetes right. So you know traditional cl um uh application, development and application development on kubernetes there's a lot of differences, but really what they have in common is the data, because that's really the center of the world.

B

um You know even with cloud native applications, you need to sort states somewhere. So this is where the persistent volume percent volume claim paradigm came to play. um Pv and pvc right. Persistent volume versus volume claim um allowed a way to abstract the storage and um from it abstracted the storage from the application to where the idea was that the application just says hey.

B

I need some storage and everything else kind of automatically happens on the back end, and so um some of the things that came along with that is uh storage classes right so like after that came about storage classes. Were there to help to tier the storage. um You know the way you have you know I need ultra fast storage. I need you know, um you know I don't care as long as the data is available, gold, platinum. You know that kind of kind of um kind of idea, um with uh with storage classes right.

B

That was all very manual but um um but then the the advent of you know the the idea of uh dynamic volume, provisioning came came to place race where no longer administrators need to specifically create persistent volumes. There is a controller that created those volumes for you right, so it connected to the back-end storage and um it created those storage on the flyer. So you never. um It removed the need for administrators to create those uh persistent volumes at scale right um and now with csi.

B

um This is making is making things easier right, the container storage interface. um So now you have a bunch of choices um when choosing for for storage right so now before it was traditionally, you just had block and nfs kubernetes. Now you have all kinds of choice, choices um that are provided for you, so so get up. Some storage right, um the dynamic storage provisioning introduced a very interesting problem right, so it was a savior, but then it introduced a problem with get ops right.

B

So with get ops, a representation of your application is stored in git. So this allows you to quickly restore you know an event of a failure like like a um like I mentioned before, um so is it you know all unicorns and rainbows there? Well, not really! So, if you think about it, um the default reclaim policy for dynamic storage is delete, meaning that when the persistent volume claim is deleted, the back end storage is also deleted right. So this um you know forget losing a cluster.

B

Someone accidentally deletes a pvc, the the storage behind it automatically goes away right, and so, um even even when, when setting a persistent volume um claim to be retained, um the back end, the restore can just still also be very manual for for restoring so um and also uh most of the storage management management still lives outside of kubernetes.

B

So um there are some um providers that have full integration with kubernetes right for like snapshotting resizing, but but that that that's still something new and that that's still something that not a lot of administrators still do right. So a lot of a lot, if not most, if not all the storage management still happens outside of kubernetes right so they're. So there's kind of like a disjointed thought behind, or you know, restoring your your application.

B

So um there's other things to consider too, when you're restoring uh from a um from from a catastrophe right, you need to do things like file permissions right. You need to make sure those are those are set. um You know you need to think about whether you're going to restore from backup or, if you're running a database.

B

For example, um a lot of a lot of the restore and a lot of data management is built into a lot of database systems right so, for example, postgres a lot of administrators can elect to just attach storage and have the database try to repair itself, and then you know you may or may not need to restore certain tables from backup.

B

um So that's kind of another decision you need to make, um um and so you know that's kind of some other other things to consider of you know: where does the storage live um and how is that management done and what you know how you're gonna do that? So um with that, um I kind of like to bring up some of the um tips and tricks that I've used, so um don't use dynamic storage. So yeah, sorry right, um I know it's kind of no fun, but in get ups world um this is.

B

I think this is actually an anti-pattern using. um You know my my opinion um using dynamic storage right, so you actually want to specify the storage you are using for a specific application. If git ops in a git ups world um you're, using something as a source of truth using dynamic storage, um you know that that can change on the back end right. So no you you! You really want to have a one-to-one relationship from your storage to your application.

B

um Use label selectors right with persistent volume and decision volume claim right. So when you create a persistent volume um and persistent volume claim, you got to make sure they match up by using some things like labels and then always remember.

B

Please set the reclaimed policy to retain so that way, you don't you lose any data in case of a catastrophe and I like to set up uh security context uh constraints right, we're really big here at red hat with scc- um and I know a lot of that's been pushed up into kubernetes proper, and so um you know you can set up uh the con.

B

um The security context, uh constraints for the uid and the group id for the application right, and so um this will help with like permissioning, and so um with that I have to go over a quick demo really really quick, um because I know this is a short talk. So here I have um an application.

B

um This application is a um you know, front-end web. uh You know, three-tiered application, front-end web a back-end um and a db um having a simple crud application right. So I do have some data that is, um that is stored on the application here. If I look at the uh configuration for this calling this out really quick is the um uh the the persistent volume here and a few things I want to call out is that one, I'm labeling, this remember: label selectors is very important.

B

um I am setting the persistent volume claim policy to retain and I'm stating the claim reference to um claim reference to say specifically satisfy this specific claim so kind of mapping those one to one there. um I have a um a volume I created on aws, specifically right, so I, as you see here, I have a little cheat sheet here that um that I use this, creates the volume and I use this volume id specifically.

B

I want to say I want to use this specific volume, and here since ebs volumes are zone specific, I want to say, make sure to always make it available in this zone, and the claim is very simple. It's kind of what you expect.

B

A few things I want to call out is one: I use the same label, um I'm not using a storage class, so I leave that blank and then I am doing a selector right, and so this is how I keep them one-to-one looking at the database, uh because that's where my state is being saved, I am also doing a node affinity right, so basically keep this a database in that same zone that I am um that I'm that I'm deploying the the storage on so that way you know they'll it'll, grab the right storage and then using that persistent volume claim.

B

I also want to call out that I'm using security context uh constraint right, I'm adding a supplemental group so that way the permissions always stay. um You know this. This application runs always as this specific uh group id. So that way the permissions uh don't get messed up so here. So if I go down here and if I do a get a persistent volume you'll see that I have that persistent volume and it's bound to that claim.

B

And then that is bound there, so I have this application. So, let's um so, let's simulate about um an error by doing um get the namespace.

B

You see here the namespace is here. So if I do.

B

Priceless is now in the process of being deleted.

B

So, as you see here, argo city automatically says hey, um this is out of sync: things are being deleted, so the application is actually gone. So if I go over here and I reload this here, applications is gone. um The pv is still there. So if I do a um get persistent volume, it's still there, the status is now released, but an actual failure. Let's delete uh the pv, the pv will be gone right, there's an actual failure.

B

um The persistent volume will be gone and so the um you know this is kind of like hey. Either the project got deleted, the the percent volume got deleted or the uh cluster got deleted and we're just kind of simulating uh me reinstalling a cluster right. So this is completely gone. The application's done, um so all I have to do right is kind of just sync.

B

This back up, and so the idea being, is that um since I specified the storage specifically in get as you see as you see in in the in the yaml, and that was already created on aws, and that volume was already there um reapplying, the uh manifests shows um it works right, and so here, as you see, everything is starting to sync up. um The very last thing to sync up here is um the openshift route, which looks like I'm back on business, everything's green.

B

I have my data back right. I can test this application.

B

Right, argo, cd t-shirt and it's you know: let's, let's give you know, three thousand dollars is give back to the community right. This is fashion and it looks like my application is up and running and it's working.

B

So here um so, I hope you enjoyed this uh presentation and um you know invite you to connect with me. I am available on the cncf slack in the argo cd channels to talk more about stateful applications. Thank you.

B