Cloud Native Computing Foundation GitOpsCon EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Infrastructure as Software with GitOps - Justin Garrison, Amazon

Description

Infrastructure as Software with GitOps - Justin Garrison, Amazon

The cloud has enabled abstractions and automation, but Infrastructure as Code (IaC) doesn't scale. You can use declarative YAML or imperative scripts and still lose control. Infrastructure as Software (IaS) allows you to control and scale infrastructure with the same practices as applications. GitOps is an implementation of IaS with lots of benefits over IaC. We'll look at how it's different, when you should use it, and where it potentially breaks down.

A

I was sitting in front of floor to ceiling windows on the top story of the building. Where I worked, it wasn't a very tall building. The view was was enough. It was a of a freeway that I was going to sit in traffic in uh in a few hours in, and I felt like I was succeeding at a lot of things I had deployed kubernetes for my organization.

A

It was automated.

A

I had set up all of these things that that let developers use templates and deploy applications quickly and- and I was a good sysadmin- I was a good systems engineer by automating those things and even if you're familiar with um core os at the time is what I was using. Had automatic updates ability to have automatic updates and most people turn those off, because they were worried that things would break.

A

I left those on, and I had this cluster set up where it was a bare metal cluster and my automatic updates would deploy updates every weekend. For me, I set set up a time that they could do those updates and I would come in on a monday and the cluster was upgraded and I was just like wow like I.

A

I have made it to this peak of kubernetes, this peak of sis edmond where I had automated some of my job, and I didn't have to do that anymore and I was sitting in front of these windows because I was writing a book about cloud native infrastructure and I was writing it with my co-author chris and I was waiting for her call and we had one last session to brainstorm, because we were writing this book and it was focused on things we learned in kubernetes and in the cloud, and we got on the call, and we were going over a couple sections of the chapters that we just didn't.

A

We didn't know how to put it in words of what we learned and she said something that I'll never forget, and she said if all of our infrastructure is apis, then our infrastructure management should be applications. It shouldn't be a repository of automation in a repository of code.

A

It has to be software, it has to be something that is running that manages those apis and everything that I thought I did right about. Kubernetes was, I had just built automation. I had built these the same thing I had done over and over again, and I just automated pieces of it, but she was right that we had to write software.

A

We had to create software that controlled apis and software is all about just taking data, changing that data and calling apis and pretty much every application and every piece of code is, is doing something on those lines or it's taking something in and it changes it for whatever you want your business and then it calls apis, and if all of our infrastructure is apis, then our pieces of software that should be running to manage our infrastructure had to do the same thing.

A

We couldn't rely on infrastructure as code anymore infrastructure, as code was just the automation piece and it didn't scale as well, because it you had to trigger at certain times you had to make sure it worked, and- and from that point I was like okay- well, what? What is the thing that exists that does infrastructure as software? What is something that is running software that takes in data and calls apis and then applies that, and the very first thing that was obvious to us was the kubernetes controller.

A

The kubernetes controller just calls apis over and over again, and it looks at the state that it wants and it looks at the state that it needs and it makes it happen, and that was the very core piece of infrastructure software. This idea that oh, we should actually be managing our infrastructure with these control loops and we found other things once we started. Looking around this pattern of infrastructure, software showed up other places if you're familiar with netflix has chaos monkeys. Chaos monkeys is like the opposite of infrastructure software, but it does the same thing.

A

It's software that runs that takes data in calls apis, but it breaks things on purpose and it. These chaos monkeys, would kind of degrade the state of your infrastructure on purpose with software, and it wasn't a one-time get push. It wasn't a repository ammo files, the the chaos monkey had to constantly look at the state of the world and make sure like oh, is this right.

A

Is that right, but both those things weren't, git ops, this was before get ops was was a term, but it it fit really well, once once get ops was announced and kind of the solidified into these four principles that we've been hearing all day.

A

Once I heard about githubs, I'm like it's infrastructure software, it's exactly the thing that we learned in kubernetes controllers that we saw over and over again at large scales at very high impact in high velocity environments. You had to manage it as software and get ups is a core definition of that. It's an implementation of infrastructure as software it just it does this control loop.

A

It takes in that data and it calls data, and when I like to visualize things, I I don't if anyone's ever seen me talk somewhere, I like to use props, I like to use things that are visual, I'm a very visual person, and so, if you've never seen these, these are aws snowballs.

A

They they have 52 cores in them, 256 gigs of ram they're snowball edge devices and- and I'm using them today to represent like standing up our infrastructure, and this is essentially, if you're, using aws. It's like a 16 extra large right. It's like that's, that's how big one of these boxes is and we'll ship them to you and you can run edge compute with them.

A

You can put storage, I think they had 42 terabytes of storage, um they're they're, pretty pretty nice to have at the edge, but this was what I was thinking when I had on-prem servers. I would stand these up and what do you call two snowballs on top each other? It's a snowman, um but when I'm running infrastructure I would deploy it and I would run these scripts.

A

That would stand up my infrastructure and it would get it all set so that every time I wanted to run it, I would trigger something jenkins, whatever it would actually deploy this, and it was great because, while I automated that piece of it, it was, I couldn't touch it. I couldn't do other things to it and realizing time and time again that infrastructure is code again it didn't it wasn't enough, because someone would come along and like do something to the infrastructure. Something would break and nothing brought it back.

A

Nothing stood those back up, nothing brought it back to the desired state until we ran that infrastructure as code again that repo of code that we had that stood these up, that's what actually would bring it back to a state that was like. Okay, now we're good and the main thing there are those controllers infrastructure software. Is you always have something watching this state that you don't have to worry about like? Oh, it's, it's gonna go it's! No!

A

No software is constantly looking at that desired state and catching it before things fall over or if it does it again, it'll bring it back to that safe force. We don't have to trigger our ci cd. We don't have to do this stuff. I just like knocking these over now.

A

No one tell my my coworkers they're rugged, it's fine, but I wanted to come and talk about. Why get ops is really powerful in the the things the core principles of get ops illustrates this really well, and yes, you're going to have more than one controller, managing your infrastructure you're going to have kubernetes controllers, you can probably have cloud controllers that do different things. There's all these controllers, that are implementations of software and the first thing that I I wanted just to touch on again.

A

Is this: what is like the code piece like what is the the thing that we always like infrastructure is code everywhere everywhere, infrastructure is code and like if I really like? Okay, that's fine all is we know running this, isn't infrastructure as code right like that's? Well, that's not it because you can't run it at your command line. You don't want people manually doing this, so what you really want.

A

Is something like this oops.

A

Let's echo it so it doesn't actually do something.

A

Right is that infrastructure as code now? That's that's that's what we just did right. I mean that was essentially.

A

I have infrastructure as code. I've made it right, but that's like it's just the automation of the thing right. It's still not any better and get ops again is very specific in how we're implementing this, and if you didn't notice like at this point, I have code.

A

When did it become software like where? Where was the piece in there that were like? Oh now, it was oh, it's done like it's. No longer software right, it's like software is code. While it runs it's only in that running state, it's only the code with electricity applied to it like when it goes through the processor.

A

Then it's software, and if you have a repo that is code, that is scripts, that's not software! That is that is just it's static and that's fine! That's your automation pieces!

A

But how do we break this up to say like okay? Well, we need this to be appliable. We need it to be always running. We need it to be software.

A

And again, as software, we want to separate our needs of the data we bring in the variables we're taking in and then the apis were we're changing things with. So, in a lot of for a lot of people, it's going to be.

A

Right like now now it's now it's kind of software. Now it's a terraform is a controller. Terraform looks at a desired state, it looks like the current state and it reconciles them right. It's just a command line version of it. It runs locally. It's exactly what a controller inside of kubernetes is doing. It's exactly what it flux and github is doing. It's doing the same sort of reconciliation and that's fine, um but it's it's again. It's not what git ops is proposing, because in this case you could.

A

Right, like that's, is that infrastructure, like that's continuously, reconciled at some level or like what terraform is going to always just do this, but like that. That is really bad like that is going to be that's going to hit limits. That's going to hit all sorts of things that, like that doesn't scale we need.

A

Get-Offs is a very specific way of doing something like this, so that we can help it scale, and I really like looking at the principles, because that helps us that informs us of what other things are doing similar things and they all kind of go along the same route. They all have similar principles and how they're going to scale and and what they're going to do so. We looked at software as far as only when code is has electricity applied to it, and so here's these principles that again you've been hearing them all day.

A

I'm not going to repeat every single one of them, but this declarative piece is all about separating out the step by step from data from the data that feeds a controller right and that that's we're declaring a state.

A

I oh for a long time. I falsely believe that declarative meant nothing was imperative, and that is absolutely not true. When I run terraform apply, I declare my state in my tfrs. My you know: terraform manifests, but terraform does everything declare imperatively right? It builds this dag that then says okay. I need to do this first then I do this first, that is, that is exactly imperative.

A

I just didn't have to define it.

A

So I for the longest time thought oh controllers are declarative, like wait, a minute that doesn't make sense like something has to say this goes first than that than this, and the important thing is to have computers do that for us and not in a script, because humans have a lot more assumptions with a lot narrows narrower scope, view of the state of the world and controllers and flux and terraform can get a more holistic view of what's going to be applied and that's when they can actually see what needs to happen in order and again it's it's not that declarative means nothing is imperative.

A

It means that you're separating out your end state to something that your interface to the thing says. I want this to happen, but something back there does step by step by step.

A

So what we do is.

B

A

All right so there's my there's, my declarative state did I put that in here yeah. So I have my my tfr is in there. uh That's my declarative state, but what is immutable? I think that's. Okay,.

A

That's it right, I'm immutable. Now, I'm I can't write to it anymore. That's that's all we did like that's all we had to do and now we're immutable. Congratulations, you're at step, two of getting said no well kind of this is this is the fundamental piece of what we want you to do. We want you to be immutable, we want you to be versioned and the main thing we want to do is actually.

A

Oops, that's see that I got permission dyed. I I wanna. I wanna version this in order. I wanna go to the next version.

A

And so in here, what do we need to do? We gotta? Let's see.

A

Let's first sort that, let's find all of them, we only want to apply latest right because that's that's we're good. uh I think that'll. Do it something like that. Let's see version two right there we go. We we're now only pulling in the latest version. So every time. Thank you next door.

A

Every time we're gonna run our infrastructure as code. This is still just code that becomes software. It pulls in this latest version, immutable state, it's declarative! It's immutable, we're almost there we're building our own get-ups uh with this infrastructure. As software mindset.

A

Polling uh in get ops is two very important things that I really like that they put this as a core piece of of get ops, and I know there's there's different sort of uh pieces here, but there's two big reasons: you want to pull your desired state, the data that you have. You want to pull that in and one is for scalability, because if you have one or two clusters or one or two servers, you're, probably fine.

A

If you have a thousand, it's going to be a little bit harder to push that out and push that everywhere, whereas a pull states, it's a lot easier for us. We just technology in general have known how to serve some files for a very long time. We can serve websites, we can serve static stuff. We can cache them, but the compute side of it is intensive and all of the api calls that result in in that action happening takes a long time.

A

So we pull that data first, we want to make sure that's stored somewhere yeah, you store it in git yeah. You store it in a web server wherever.

A

But that's that's one important thing and the security aspect is the other one, because if you haven't been paying attention to any sort of breaches over the past five years, a big way in the door is is cicd systems that have global admin access everywhere and, and that is difficult when you can just get to one system and then you can move horizontally anywhere and we want to prevent that.

A

The downsides, of course, is you have to run more controllers. You have to run more software places that only have limited scope of what they can apply to, but the benefit there is if they, if something breaks in that one limited controller, it doesn't have a large blast radius. It has a very limited for security for downtime for all these other things. So this is this pole aspect is really good for the scalability.

A

It's just a lesson learned right, like weave works, had been doing this for a little while they're like hey, you should probably pull that stuff. Here's. Why and it's a great principle to have, and so in our infrastructure.

A

We wanna do something like this right: we're gonna we're gonna, pull that data down. Let's say it stores our our variables file and once we pull it down, then we're almost set to go right, because then I can run that tear from apply, and I can assume that you know bash being bashed. I could assume that if that fails, my my code would exits and I don't actually have a problem of half half variable files applying somewhere.

A

In this one continuous reconciliation again it always, it was kind of a trip for me where I didn't know how continuous that meant. Does that just mean while true and it doesn't it's not about being continuous? And it's one end we can just say you run every every loop every time you finish you try again on the other side. Is our traditional infrastructure is code, which was only when a file changed. I don't know if anyone did config management for a long time.

A

Once I changed my puppet manifest, then I applied that manifest to servers, and that was fine. As long as I kept sort of a normal cadence of changes, but in something like infrastructure, you have a low level there of say, dns network infrastructure. These things don't change very often, and if you're changing those with a while loop, you will likely break something, and that is scary, and so you wanted something beyond the like.

A

Only when a file changes, because again, if, if, if I wait for a file change and that happens to my network, I'm going to be parsing, terraform state files, I'm gonna.

B

A

Fixing terraform state files and if anyone has done that, that is a bad night on call. uh I am sorry if you've ever had to go through that, uh but that was that was a downside of infrastructure. As code right, like I thought, when I had my terraform as code or my my terraform manifest and when I was on call- and I got this call and I was like how could it be down? I have infrastructure as code.

A

My infrastructure is code solved this problem and then my state was completely to a point where the steps terraform was gonna. Do the imperative steps. He said. I can't I can't get from here to there, because someone in the console did something that broke everything, and now you have to go fix it.

A

I had to go in and I had to manually, adjust terraform state or I'd manually, go to the console and figure out the knobs to turn to get it back to some place, that terraform could apply it, and that was the difference between infrastructure's code and this sort of get op style, continuous reconciliation, and that's the last piece that we want here where I want to take this script, and when I want to do it more than just one of file changes.

A

I want to do what infrastructure changes too, but we're going to simplify it, for uh because this is a crappy demo of uh what we're doing here. So, let's say um if you've never used. Iwatch, oh and I gotta find it there. It is because I couldn't remember this one off top my head.

A

I watch we'll watch for file system changes uh and and run a command, so we're telling iwatch as soon as a file is written in this folder close right is the is the thing it's going to look for in this folder run my infrastructure's code script and and so over. Here I can actually.

A

Let's do version three all right and now I saw like right here it it immediately applied that thing. This is still at this point. Oh I did. I got version two. Oh, I got a bug, see that's a problem.

A

Oh did you. I spelled it wrong. Look at that good catch, see and that's what code review's for.

A

There we go now we get version three there. It is. Thank you see. We've got that in code review. That's still in this case. Obviously, looking for file changes on disk but get ops is doing that reconciliation and one of the cool things about flux and argo, and these things they're they're doing that two-way sync and it's one step beyond what a typical controller like this is going to do where I'm only looking at my local files and traditionally we're only looking for a git push and what you actually want to do.

A

Is you want to look at that full state of the infrastructure? And you say: hey, I'm just going to check it terraform's going to go out there and terraform planet every once in a while and say hey, I think something's different now and you can hook into these sorts of signals. Amazon. We have eventbridge, there's all these different ways that you can look at what is going on in the infrastructure and you can trigger things based on any changes right.

A

I could look at my cloud watch logs or cloud trail and say hey whenever something happens with this scope, I want to run that controller. I want to make sure that my infrastructure is back to where it was.

A

And that was that was for me the just how I started growing from what did infrastructure's code look like, and why was code?

A

Not enough now to we had controllers in kubernetes controllers in how those get applied inside the cluster was a great thing, because again it knows the entire state of the api server, the data that stores in and it gives you a ton of different annotations and logs and events inside of kubernetes and then applying that at a more generic way is really what git ops is all about and that's and you can apply git ops principles.

A

You can apply infrastructure as software principles to anything and it's not just the kubernetes pieces because again, you're gonna have different scopes for these controllers and if you think that you're going to do one controller and it's going to do everything, it's like writing one terraform main file. That does everything you don't want that scope. You don't want that blast radius for one file for one controller, so you do want to separate these things out. You want to have that limited scope.

A

Having pull base limited scope inside of environments is a great idea and then applying this whenever there's a change on the infrastructure or on files is, is really that last piece and and that's really all I have for why infrastructure, software and git ops works so well together. Git ops is a implementation of infrastructure software, and that is the main direction that all of this should be going in, and I think it's been a great progress with good ops in general. So far, so thank you.

C

uh Well, well, well done! um Thank you so much for that presentation. We do have a few minutes for questions. Are you available to take some questions so uh questions you can raise your hand or we have a mic uh at the mid. You can just jump to uh any questions.

A

I will say I am showing off running kubernetes on these in the aws booth tomorrow uh with eks anywhere, which also implements git ops. So if you want to see them in action with kubernetes come by the booth tomorrow at 10 30., those.

C

Aren't hollow no.

A

They are real, they are, they.

C

Are heavy? You were really just punching those over that whole time. I thought I was like. Oh he's just got the case up there.

A

No big deal, no, they they are real. All right.

C

Well, good luck with the demo tomorrow yeah, so I think any questions raise your hand, don't be shy.

A

And I'll be around all week, so sorry.

C

uh While they're thinking of one um uh you, you were mentioning that, can you guys hear me? Okay, because I can't hear the mic feedback, but um you you were mentioning that uh you felt like on some level. It wasn't declarative because it relied on an imperative system for operation.

C

um So would you I mean in that case there is no such thing as declarative right, because everything always relies on an imperative operation like even if you built, you know a language that was only imperative at the end of the day. It has to be. That was only declarative. At the end of the day, it has to be put into byte code, which is imperative out of the cpu. So I was wondering if you'd maybe just speak on that for another moment or two.

A

Yeah, there's also no such thing as immutable, which was kind of mind-blowing. For me to think of that, it's like no. Actually everything changes over time once we run it, and- and those were two things that as a as a system, I kept thinking like oh well. This is this is how it has to be, but realizing that real, the the main benefit was to the humans.

A

The main benefit of all of this is that you can sleep at night so that you don't get paged so that you can get paid your paycheck and you can have a good balance of life if you need to scale if you need to automate things like automation goes so far, but there's always has to be this trigger, and git ups really applies that trigger and we're still just we're doing a lot of the same things and we're just pushing that out, but the human interface to those imperative systems and the human interface to those.

A

What we thought were uh immutable. It was okay, because my my view of what's immutable is like oh well. No, I want this application deployed, but then I have this other controller. That comes in and says like, oh, but I need to scale it up. I'm like well, I didn't tell it to scale up, so it's not immutable because I told it five replicas, but something else figured out it needed ten, and at that point I'm like well now it's not immutable. So is it bad? No, it's wonderful because I didn't have to do it.

A

It was a huge benefit of having these controllers and something to figure out that when I say I wanted five, it said well, I need one first and then I need two and then I need three and it had to do those in order and it had to the scheduler had to bind them and a cubelet had to pull them down, and I had secrets and all this stuff had to happen in order.

A

But my view, the human interface to the rest of the system is very declarative and I like to think of it as if I'm driving a car I can. I can call an uber and say I need to go to the grocery store.

A

My interface to that system is, I made it to the grocery store. I was there. I could also drive myself.

A

I could jump in a car and I could steer the wheel left and right and step on the gas and the brakes, and that is very imperative, and that is me telling it exactly step by step every little thing, but I don't have to say piston one fire piston, two fire piston three, like that's, that's a horrible interface right, but it's still there's imperative things behind every declarative thing we do and where what stage of declarativeness that I want- and I need all depends on how much ownership and how much control I need.

A

If I need to get there in five seconds ten, you know five minutes and it's the 30 minute drive away. I can't request an uber because I'm not going to get picked up so then, so I need some control there like I have to get there fast. You know I have to get to the hospital my wife's going into labor. Let's go like I'm, just I'm not going to sit there, I'm like! Well, let's call a cab.

A

Sorry, let's, let's hope you get there. So it's it's all about the interface with humans.

D

uh Just because I didn't see any other hands up, yeah yeah, so would you say, would you say that.

D

Well, I don't know like I don't know if it's necessarily ownership, but would you say it's how much you trust you? You can trust your software that, because you're describing infrastructure as software right um yeah.

A

There's always bugs in software I I wrote how many bugs today in three lines of bash, there's always bugs in software and a lot of that trust trust only comes over time right. So it's like why I have to trust other people that thank you for the pr. I like the the review, like we trust the team.

A

We trust experience, we build that trust over time and then we can trust that software over time and some of that trust comes from just generally other people use this thing right, like flux, is amazing not because I've used it forever.

A

It's because a thousand other people use it and they told me some practices to do it, and I trust my car, I'm not my wheels, aren't gonna fall off on the way to the store, because I didn't put the bolts on, but I trust the mechanic who did it a thousand times that they did it correctly and and there's always that implicit trust of just like well, I have to trust someone in this case and and being able to trust other people and trust the ecosystem and trust the tooling again.

A

Please don't write your own bash controller for get ops use one that's fairly trusted because you're gonna have those bugs and and but to gain. Your own experience adds more trust, but a lot of that comes from knowing how the system breaks in your environment and knowing like. Oh, my process, didn't align with how that thing was intended to be used, and so once you have some sort of intention-based thing where you're like.

A

Oh, I think it's going to work this way, but I have five teams that are deploying to their own repos that all merged into this other thing like that's, where you kind of get into problems because you're like oh, actually, my my ideal of how I was using it didn't align with how the rest of the community was didn't, align with the process inside our company, and so.

A

But you have to build that trust for yourself as well, but yeah gain that trust through these common, tooling and common practices and that's again, git ops has those four definitions for a very good reason in the tooling implementing them. For those reasons,.

C

uh Open for audience questions yup got one over here.

E

Hi uh great talk, um one question: I don't know you don't show it, but it's always in my head when I use argo cd. um If we revert in a case of kit ops, what is the best practice to revert it over the ui on argo cd because you have a history and revert, possibility or actually a single point of truth will be a git right. You go on git and revert it there and then you have auto sync.

E

You have a sync which will in case of issues in case of bugs we'll, deploy it immediately.

E

It's also in my head. Why we have this history and rollback in argo cd, and should the developers use it or not, should the users use it or not, or should we go over the kit where the single point of truth lies.

A

I don't know the very specific on argo's implementation of that, but in general you always want to roll forward. You always want time to progress forward and, and computers sometimes have more gotchas when all of a sudden time reversed- and we said hey, I already saw that states um we were.

A

I worked at disney plus for a while. We were managing infrastructure and we had we built our own controllers for how we were managing our clusters and we had that same question was like hey. What do we do? How do we revert? We deployed something that was bad. How do we get back to a known, good state and yeah? We could go to get all of our state was stored and get we could go to get say like.

A

Oh, I'm gonna make the new head the old version, and we had so many weird things in software that didn't ever assume. That latest would be something. That's not the most recent and it said like. Oh no latest is the most as far as a time stamp goes and a lot of systems still deal with time stamps, and so what we decided for our system was always that we never went back a head version. We always went.

A

We always did a git revert that pushes one more head version ahead and it says oh I'm going to take that thing and I'm just going to undo this commit, but it always has a new times. It has a new commit, has a new timestamp, it always moves forward, and so in software and infrastructure, a lot of those systems are just easier to mentally.

A

Think of when the current state is always the newest time stamp state and and so figuring out what the newest time stamp would be and the newest commit as far as history of dates just because we live in a time synchronized worlds, it's just easier to reason about in a lot of ways for all those controllers. So I don't know if argo has a specific way to do those reverts, but I know in other cases it's just a lot of times more safer.

A

Just to say, I need to revert meaning I go forward in time with a new, a new checkout.

C

uh The argo maintainer in me says: let's talk after this. um I think we have time for one more question, while they're setting up for the next talk one more question from somebody um big pressure, you got to get a good question: no okay! Well, let's give one more round of applause. Thank you. So much.