Red Hat OpenShift OpenShift Commons Gathering: Los Angeles 2021, 3 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Site Reliability Engineering, Managed Services and the path to the future Sasha Rosenbaum (Red Hat)

Description

OpenShift Commons Gathering 2021
Site Reliability Engineering, Managed Services, and the path to the future
Guest Speaker: Sasha Rosenbaum (Red Hat) @divineops
https://commons.openshift.org/index.html#join

A

So while people are coming back in, um this is going to be a talk on sre managed services and the path to the future, and my name is sasha and I work for red hat.

A

So, by way of introduction, oops um I've been in this industry for a long time, and I started off as a developer. I have a computer science degree and then I went and had all sorts of different jobs in technology which most of them didn't exist. When I was a kid, so you couldn't even choose it as a career path.

A

By and large, I like solving problems with people and technology, and I like to believe that the world is getting better every day and that we are in a good industry to help it get better, and so that's why I'm excited to be at this conference today and be talking to you all and maybe we'll come up with new ideas. That's why I'm excited to be at this conference today.

B

Cool and be talking to you all and.

A

Maybe we'll come up with new ideas. That's why I'm excited to be at this conference today.

B

And be talking to you all and.

A

Are we gonna fix this.

A

Okay- hopefully it doesn't happen again, okay, so anyway, awkward, um I'm going to be quoting this book a lot. um This is the book on site, reliability, engineering, uh the first one by google, co-authored by a whole bunch of awesome people. I really like the book.

A

It's a really good book to get started if you're just getting into sre concepts, and you want to understand what they're all about I'm going to start with the sentence that I least like in the whole book, and that sentence is sre is what happens when you ask a software engineer to design an operations team like. I really really don't like this.

A

This is usually my face when I see a definition like this, um because I think it's very elators and it assumes that developers are cooler than ops people and that ops people couldn't come up with the idea of automation, and you know google had to come in and solve all the world's problems. um It really kind of isn't what happened? The definition that I do like is that sre is roughly google's implementation of devops.

A

um That definition is that actually also in the sre book, so I didn't make that one up, um so we started off with devops more than a decade ago, and this happens to be the picture of on the first two years of the upstate chicago and I'm the only person in this picture of organizers that identified as a developer.

A

Most of the other people identified as ops people and all they really really wanted to do- is automate themselves out of the job, and we've been running this conference for a while discussing certain type of ideas.

A

um There was an awesome person and she's right here and wave at me, bridget and she helped grow this conference to like a global enterprise which, like thousands of people in hundreds of cities, show up to um and again we were all talking about automation um and you know how do we get to a better place where we get to solve more interesting problems?

A

It's not that easy to get to the future.

A

So if you were alive in the 90s- and you remember what they looked like, you know that getting a new server app if you're lucky it took you three months right, because you had to actually file a procurement order and you had to wait for the actual physical server to show up. And then you had to build a server rack and you had to wire it up and configure it and stow stuff and whatever also in the 90s.

A

This was very common, unfortunately still happens today, right system downtime for two days, because we are upgrading and deploying a new code version. How many nines is that? Does anybody know?

A

So if, if you had a couple maintenance windows like that, that would be less than two nines, because two nines only gives you 3.65 days a year of downtime. So that's just plain maintenance right and we took down servers for planned maintenance for whole weekends um and we used to think that sports and reliability are not on the same. That's just plain maintenance right and we took down servers for planned maintenance for whole.

A

I'm loving this um and we we used to think that speed and reliability can't be friends and devnops can't be friends, because devs just are incentivized to push the production as quickly as possible, and that breaks things and ops are incentivized to keep the lights on and they carry pagers and they get paid in the middle of the night and they hate change right.

A

It's all about incentives, um but the allegory that works better for software development is actually it's like riding a bike right like what our um inherent assumption is that if we go slower, we break things less, but it's actually not always true, not in all domains and software development is kind of like riding a bicycle. If you go too slow, it actually breaks more. You can't keep your balance.

A

So if that is the case right, then why was there such problem? Automating things like we if, if going faster, is actually better for everybody?

A

Well, the biggest thing was that effective automation requires consistent apis um and that's something we didn't have so like one of the words that pops up a lot in the stock is apis um and you need them to be able to automate anything. Clayton just talked about, like you know the bigger control plane and being able to automate something at a across clouds level right, but we started off at the basic level, so you had to start with operating system level, api and then with linux.

A

It was lucky because it's a file based system, so you can write, write a script and automate things, but with windows it was an executable based system, and so you depended on people have an actual api for the stuff that you wanted to automate and guess what people didn't have an automate the api for the stuff you wanted to automate and it was actually 41.

B

Of the market server market in 2000s.

A

So like it was a real problem that people were trying to solve, which actually brings me to one of my favorite transformation stories, which is a story about powershell championed by jeffrey snover, and that's the cli scripting language and a configuration management framework that shipped in with windows in 2006 and before that jeffrey went through five years in his career, where he was on the verge of getting fired every day, because angry executives were yelling at him. What part of effing windows? Do you not understand? Admins, don't want apis um turns out that admins.

A

Do you want clies and apis and want to automate things, and fortunately, automation, one in this battle. Every wave of automation enables the next wave of automation right next, we got to infrastructure level apis. So this is another quote from the sre book central to borg. Success and its conception was the notion of turning cluster management into an entity for which api calls could be issued right.

A

So basically, we arrived at the idea that we needed an api for the entire infrastructure and it had to be consistent and it had to be managed and manageable by automation, automatable and it wasn't just google. Obviously it was amazon. It was azure right. Everybody was kind of arriving at the idea uh that there was this pressure to deliver adaptable services at scale, and you need api to do that.

A

There was another thing that was happening in kind of a slightly different part of the industry, which was like, if you didn't, if you weren't, google or microsoft, and if you didn't run like gazillion servers and could custom order the server x. The way you wanted them, um you still need automation and so companies like puppet and chef and ansible we're starting to build that automation for sort of your own data center right. What was in there what's new is service level objective and that is business, approved, availability.

A

So there's this concept that 100 reliability is actually unsustainable, unnecessary and also extremely expensive right. So if we even talk about not 100 but the five nines, which was everybody's holly, grail right, that's five minutes 26 seconds a year of downtime available downtime, that's all you can have with five nines and the major question is: will your users even know that you're that available and the resounding answer to that is no?

A

They will not, because the internet service provider's base error rate is up to one percent which is like you can be available for nines and then the rest of it will just drown in the network errors of the of the isp provider.

A

So you're essentially spending lots of money and effort trying to attain something. That's not actually useful to anybody.

A

So slos are about aligning incentives between business and engineering by getting people to talk to one another, getting business to agree that 100 availability is not something we're actually going for and then with slo comes the concept of air budgets and that's acceptable level of unreliability.

A

So the air budget is your one minus the slo so like. If you had four nines right, that would give you 0.01 percent, which would give you 13 minutes a quarter.

A

13 minutes a quarter is not a lot of downtime, but it's a lot more than five minutes a year right, and that gives you some ability, some breathing room for the stuff that you can do for the time that you can be down, and so air budgets are actually about aligning incentives between dev and ops, because if developers are measured on the same solo that operations people are measured on, then imagine that I have that 13 minutes a quarter right and I'm pushing code and I'm writing new code and I'm making changes right and then it gets gets to the point where we're like at 10 minutes out of 13 right.

A

So I have three minutes of downtime left because we had some outages because we push new changes and I want my next promotion. So I want to push that big, big change at the end of the quarter right.

A

So I can get a promotion at the end of the year, and so my best interest is to test the hell out of it before I push the ops people to to push it right, because I only have that three minutes left in my budget in my error budget for the quarter, so um slos and error budget actually help us align, speed and reliability right in a way that um makes everybody be more successful.

A

So I'm not going to dive into any of that, but other things that are important to um sre is monitoring right. If you don't know that your service is up or down, then none of this matters, because you can't actually measure how many nights you were talking about or anything.

A

Of course. Observability is another concept, that's related and that again talks about how much you know about how your services are doing. um This is important to me, and I know some other people who you know carried a pager in their life.

A

You need a good signal to noise ratio, because if you're paging people about every single thing, that's not important they're going to stop responding to pages and if you're, not paging people, when their help is actually required. That's also a problem we could also dive in into who should carry a pager, but we won't anyway. um I do want to say like so. It always sounds like when you talk about a serene automation. It always sounds like automation's gonna solve all everybody's problems, so there is a little bit of caution in here.

A

Animation can also be dangerous, like it's a really good way to make errors rapidly and at scale. You can take down the entire aws infrastructure, with one failed line of script.

A

Then the second part of it is that automation drift starts immediately, so you write a service, you write automation for the service and then you update the service. Then you have to update automation right, so you immediately start accumulating those uh differences between the automation and the actual services. It runs um automating, one of this inefficient.

A

I could spend six hours automating tasks that actually takes me six minutes to do manually and if that was a one-off and it's never going to happen again, then I just wasted time um and then importantly, very importantly, all systems are socio-technical right. So the goal of this is never to automate humans completely out of a system I mean we, we make this arrow all the time. We're like. Oh we're going to automate all the things, because humans are the problem.

A

I mean humans are the problem a lot of times, but also they're the solution right, because um the second law of thermodynamics states that the universe goes towards chaos right. So all systems left unsupervised tend to work. Chaos and entropy always wins in the end, so you need a human to maintain order.

A

So let's talk about what the future is and clayton said that he doesn't know what the future is. I do so it's not no, but I think there's a certain kind of level of goals that we all have right. We kind of are striving towards the same thing. um The the future is already here, it's just not all evenly distributed. So I know if we talk about like the five nines and all the fancy automation things.

A

There are companies that are running at close to five nines and their companies are like well, I had to take my service down for two days just to update, because you know we had a merge hell and we don't actually test anything before it gets to production and stuff and that's what people live with.

A

People also have like 70 years of what we call legacy code and that's the actual thing that makes them money and they have to run the business you know um so I think and I'm biased, because it worked for ed know, and I you know I'm on a managed services team right. So I think the future is managed, services um and managed services can be defined in you know many ways right, but um it's all about platform as a service right.

A

We've we've been talking about platform as a service for a really long time right and we've wanted platform as a service for 10 years or probably 20 right, and we just all we want, is to get to the point where we can run our applications um and there were many attempts to implement a path service. Maybe some of them were more successful or less successful.

A

The problem is the past really works as long as your environment is homogeneous and no one's environment is homogenous. If you have a big enough company, you don't have a homogeneous environment right, you're, probably running a three clouds, a data center, and then I don't know some.

A

Some spreadsheet somewhere runs on excel and someone's laptop like it just happens, and we know that effective automation requires consistent apis and we know that every wave of automation enables the next wave of automation, which is why I'm happy that I'm a kubecon, because I think that grace is potentially something that will allow us to proceed to the future and have that consistency and have that consistent api across different systems and different deployments, and so 85 of global I.t leaders agree. That kubernetes is the key to cloud native application strategies.

A

I don't know if all application strategies, but you know cloud native application strategies um so point is everybody- wants to have a piece of kubernetes um which is cool um and the other thing is like we. We all have open source now, which provides it like open source one um and it provides us with a way of setting up a standard and letting people do kind of share knowledge.

A

What we have in common and work together to define what that consistent api looks like, but the problem with open source- and yes, I think, probably everyone is going to have this slide in their presentation and the problem with open source um is that you know you have the proliferation of services and tools and all the things um and that's a real world picture of someone trying to run kubernetes in production.

A

So that's what it usually ends um like and but you do have an advantage today compared to just a few years ago like if you want to get out of the data center management business.

A

You can go to the cloud and if you want to get out of kubernetes management business, you can go to openshift again, like I said, I'm biased right. So I'm on this team that works on these services that are called red head cloud services right and and we run on top of different clouds. Actually you can pick your favorite cloud and run your manager openshift on one of those clouds um and openshift is kind of an opinionated turnkey way to get all the bells and whistles that you need inside your kubernetes.

A

So you don't have to browse that cncf slide and identify whose project is maintained by a single maintainer on weekends, and you know, depending with all your security on something that joe is maintaining his garage when he has free time um and there's the whole thing which, like we do actually as um run sre for the folks um who rely on these redhead cloud services and on top of different clouds, which is an interesting problem to solve, because we don't own the infrastructure right and we are running sre on top of infrastructure, we don't own, which is exactly the same problem: every company in the world: who's, not google, microsoft or amazon strength, so so we're trying to sell it for other people, which is cool.

A

um You know at red hat, so we also went through a journey. um So when we first started offering these services there was a you know, sla of two nines and now it's four and we're trying to get even better and better. You know because a continuous improvement is a thing.

A

um So if you compare the traditional organizations with cloud native organizations in a traditional manner again, we have this proliferation of different infrastructure and we have this proliferation of different uh platform services right and again, as you standardize you're just getting to enable people to um automate this complexity and to standardize on something that it can all share across board.

A

And so eventually, what you want to get to is that infrastructure services run by the by a cloud provider or somebody else, platform services run by by somebody else, and then you only have to work worry about the applications that you build.

A

There's this picture. I I like this picture it. It comes from like originally from hans morabik um talking about ai um taking over the world, which probably eventually will happen. I don't know, um basically it's like a picture of this water rising in the landscape, right and so ai gradually takes over people's jobs. um We're not talking about ai, yet here we're talking about automation, but it's still happening right. The api kind of gets higher and higher.

A

So if you are a driver, you probably want to look at different career paths, because self-driving cars will eventually arrive right. So the goal here is to keep your skills above the api and solve actual smart problems.

A

Instead of doing something, that's going to be table stakes in a few years, um so to the extent possible, you want to outsource your sre to your platform provider and last but not least, I wanted to mention something that um red is working on and so first of all, ideas are open source, which is why I learned a bunch of ideas in these slides from other smart people, and hopefully other smart people, learn ideas for me sometimes, um and we know that open source one because it's cool, but we now facing a slightly different challenge that we did before.

A

um So you know in in open source, we are always trying to incorporate the knowledge that we learn back into the code base right so upstream. First we're trying to share, um but now that we're moving to everything is a sas right. We're having this problem again, where the platform is proprietary right, so we're no longer sharing knowledge. We're no longer contributing the knowledge back to upstream um and so redhead is. This is super initial stages, but we're starting this new initiative. That's called operate.

A

First, it's a concept of incorporating operational experience back into software development right, so you can find on some of these concepts on the website. It's um operatefirst.cloud, I'm in this um an effort to basically get people started with a playbook for learning how to run sre and also in a playbook for sharing operational knowledge across different clouds right, so we can all learn from each other in terms of how we run these services.

A

So that was all I wanted to share with you today and um I'm sasha. You can find me on twitter. I especially follow me if you like cat videos and I'd be funny. um I'd be happy to continue this conversation because, like I said, I think we all learn from each other. All the time.

B