Red Hat OpenShift Transformation | OpenShift Commons, 30 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing Cloud Native Operating Models with Andrew Clay Shafer Red Hat

Description

OpenShift Commons Briefing
Cloud Native Operating Models
Guest Speaker: Andrew Clay Shafer (Red Hat)
Hosted by: Diane Mueller (Red Hat)

A

All right welcome again to another openshift Commons briefing bus. Today we have Andrew clay Shaffer. Here again. This is his second round on our Fridays with DTO and he's going to talk a little bit about cloud native operating models and what what that means to him. I, I kind of love, the topic, but that the title, but don't quite know exactly what he's going to come up with today, so Andrew, take it away and we'll have live Q&A at the end of this yeah.

B

Neither neither do I, I, never know what I'm talking about. If you, if you want to jump in and have me explore, explain anything you know it doesn't have to be just me monologuing if you.

C

B

In everyone, so this is this is just a quick kind of framing.

B

So you know, I had a very privileged career in a sense to see some of the things and participating some of the things that I have for the last decade and and that gave me you know certain framing- to talk about some of this stuff, and you know I would say that the pattern of my career has really been trying to take the things that I've seen work and and put them together and and share them in a way that other people can make them work too.

B

So, going back to you know the last 10 years it's really been focused on open source infrastructure and in products back to puppet OpenStack in, and you know now more and more kubernetes in parallel to that. I was part of organizing these DevOps days globally, and you know these communities of practice around velocity conference, and so those conversations and- and you know, being part of those projects gave me what I think was a pretty unique vantage point to try to help these. You know the customers and and the communities around these different projects.

B

So the the overarching theme of lots of this is, you know, there's there's the code and there's the git repos and then there's what do you actually do with that and how do you make it work? And- and so this is me- trying to distill some of the thoughts and conversations I've had in the last few months and try to at least remove some confusion, I'm not sure I'm gonna answer everyone's question: I mean in in some ways.

B

My hope is that you have the ability to ask better questions at the end of this more than then you have all the answers so from there the thing that I want people to realize or that I think people actually feel viscerally is there's this tension between these old ways of doing things.

B

You know processes, technology and sort of the new way of doing things right and that's that's a little bit of an oversimplification, but the the thing that I would point out and and I think if you've tried to do this in a real sense, especially in a large organization that if you have adopted something like a container platform now that we know of any and you're managing it with your old processes and you're managing you know cut with the old mindsets, then the rate limiting the rate limiting or the you know, upper bound of how successful you can be with that.

B

New technology is limited by the fact that you're imagining it in in these old ways and so the new tools sort of require, or not necessarily choir, but they're optimized by by new thoughts and new behaviors. So in this old world you know we used to manage servers and we would brag as sis admins about the uptime on these servers that have been up forever and not restarted. And now we live in a world where you know, containers might be. The life of a container by design might be on the order of seconds.

B

You know minutes, certainly not certainly not days or years like like we used to be so proud of. The other thing that's happening is you're getting more and more of this work. That is, you know the automation enforced by by api's and and not done by by human toil, and that's the theme I'm going to come back to you over and over, and so the other thing I think is worth pointing out, particularly in enterprises as they adopt.

B

These things is that these patterns of IT and framing as a call center and having these processes is really rooted in supporting business as usual, where the IT is a secondary consideration to whatever that kind of core business was, and it's not that you want to necessarily forget who you are, but in the in the new world IT and technology is much more central and- and you know, if you look at the market dynamics and the performance of the top end of the market is sort of dominated by the you know the fame companies or the cloud companies where you want to call them where they're they're using technology not not to support business as usual.

B

But technology is the business right and so that's a kind of fundamental mind shift as well. So here's some questions and, like I, said I'm not in this league in an answer question but I'm kind of gonna answer questions and we'll have more questions by the end. So this is. This is relevant to this narrative around.

B

You know what we've seen over the last 10 years and what people are trying to do and if you look at what what I see- and you know, there's across the kubernetes community in general and definitely in the OpenShift community- is that people are in very different places with respect to how they take that technology and put it into their into their workflows into their. You know, behaviors and and where, where they let it change their behaviors or not.

B

And you see this in a lot of cases- they're they're adopting a new technology, but they have a process that came from how they manage their VMs that came from how they manage their bare metal and they haven't really redone or rethought from first principles. What what promises they're able to keep with those processes? You know versus the opportunity that they're losing by kind of slowing themselves down so I'm, going to give you my definition- and this is you know, there's like this whole sub-genre of people debating the definition of words. But this is.

B

This is Andrews version of what DevOps is, and you know, I've, given this or set this on stage multiple times over the last few years. To me- and this is this- is in the framing of software's eating the world right. So there's this notion that software thing in the world and to me that means software is going to optimize everything that it can over.

B

Some you know, timeline and, and to me, DevOps is really about optimizing, that human experience and performance of operating software doing it with software, so that's sort of the software in the world and then recognizing these are social technical systems. It's not just the technology by itself, it's for us by us and so you're gonna do this work with humans. So this is. This is like you know. The last ten years of my career is, is going around the world and talking about and helping people try to adopt these these processes and these technologies.

B

So at the same time, in parallel to this, there there's this conversation and I'm gonna come back to Sree and end up, often in a few ways, over the next few minutes. Thirty minutes or whatever, but this is the quote-unquote beginning. You know this person coined the term Sree at Google, and this is a quote from from Benjamin. What happens? Sree is what happens when a software engineer is tasked what they used to be called operations and.

A

Just for a second, you say: miseries and some people might not know what it stands for after.

B

All so site, reliability engineering is, is, you know, quote unquote the Google implementation of DevOps there's there's free book, so you can read so there's if you, google, sre book you'll, get to a link that is hosted by Google that has all of the text for free. If you like to buy, you know dead trees or whatever there's there's hard copies as well. There's there's the SRV. The original sre book was 2016, which I'm gonna come talk to some of the points from that book and then there's an SRE workbook that came out.

B

2018 2000 2018 thing that I wrote the foreword for that book.

B

That has, or one of the four is the other ones Mark Burgess talking about so the first one is really essary as Google envisions, essary itself, aspirationally inside Google and and some of that's a little bit sort of navel-gazing in in, like Google, specific and and then the second workbook is, is an attempt, and it's mostly Google people, but there's some more collaboration outside Google to bring sre in practice in a practical way to to like share some of those practices, and this isn't necessarily all about you, sorry, but there's gonna be some more sre content and like kind of talking about some of these models, so why?

B

Why does this even matter like like? Okay, so there's words and you have DevOps, you have SRT like what is it was it actually do and- and it's interesting watching you know this- is this- is a common theme with you know: agile, devops, transformation, sree, where you see lots of people, adopt the vocabulary and maybe change their titles, but they don't really change their process. So, let's, let's try to do better than that.

B

So the why, in my opinion, for for both these things and I would argue, you know, DevOps necessary- are essentially the same phenomena and that's part of the theme that I bring out in the foreword I wrote for that book, but but really what's happening here is that you have these new models being created as part of the evolutionary pressure to deliver systems that are, you know, flexible, adaptable, changeable and at scale with high levels of reliability right soso reliability is even in the name of the way the Google framed it for themselves.

B

So the these models, together with the technology, are what enable all these things that we kind of take for granted right. So there's lots of there's lots of things that we take for granted. We all walk around with these these little supercomputers in our pocket, and we sort of think you know and for the most part Google is going to be available. Gmail is going to be available. Facebook's going to be available.

B

Twitter has a few bad days, it's now in them, but like been pretty good lately right, so we we take this stuff as like the kind of baseline ambient experience that we all have and then, when we get into the sort of enterprise IT, you know- that's not always the case for for all of us and in terms of what we can deliver from a reliability perspective, and part of my argument here is that there's there's certainly technology that can that can help you improve those promises, but there's there's also this sort of human workflow aspect that has a huge impact on building those reliable resilient systems, so I'm gonna kind of walk through a few things.

B

What I'm going to talk about platform patterns, then I'm gonna kind of compare and contrast this notion of you bill that you run it, which is, is sort of famous we. You know the Amazon way of doing this, which is slightly different from the Google way, but I'm going to kind of show how they're they're, not that different in some ways and then and then go back to the sre, more specific language and sort of practices.

B

So this is this is me drawing shapes on a slide and the argument I'm trying to make here and- and this is you know again oversimplification for the sake of making a point- is that in the in the traditional IT lifecycle that you have these stovepipe infrastructure with purpose-built, you know build up to deploy a particular app right, so so projects would start with the PIO and then there's some lifecycle to get the the hardware in the data center.

B

You know that might be months, then then at some point there's operating system, then you start putting together and and then you eventually get an app and that's long-lived and and tied together. So you kind of have to refresh them, and then we get a little better as we get to some automation and virtualization. But what you see happening in the cloud native organizations? Is they collapse a lot of the complexity of the infrastructure layers and then they they want to spend more and more of their quote? Unquote. Complexity, budget, delivering value of the application?

B

So if you go to a lot of enterprise, data, centers or colas or whatever you walk around and every rack has slightly different configuration of gear. If you go to cloud native, quote: unquote: data center- and you know, there's the open hardware stuff and then there's the Google stuff, not all of its open whatever. But if you get an opportunity to go see one of these data centers is really really like. Football-Field-Sized data centers filled with racks and racks of identical Geir, because they're they're really collapsing the complexity of what they have to manage. I.

B

Think it's also worth noting here. When you start talking about so people say things like the ratio of the machines that are managed by these kind of system means at one of these companies versus what you see in the enterprise and a lot of that ratio is coming from collapsing. That complexity. It's easy to easier to manage a thousand things they're identical than it is to manage, in some cases, ten things that are not identical right, so so that law, that ratio and that efficiency comes from from collapsing. That complexity at the bottom stack.

B

So this is a pattern you see over and over, and then you look at what what comes up the stock into kind of oil called the platform services, every single one of these organizations and there's a list of more that you know built various aspects of it. This built this sort of self-service provisioning platform for their developers to be able to do work right and Google's been very public about how they did that and what they did.

B

That Amazon hasn't necessarily been public, but they made some aspects of what they, what they learn and built very public by launching ec2 in 2006 and then and then Google. You know, obviously, everyone's trying to kind of play the the cloud provider game now, taking all those lessons from that. So everyone build these things from first principles, slightly different ways, but they build them because they had to. There was no, there was no community project that was those helping solve these things.

B

But if you go look at what Netflix build, you know, circa I'd say 2010 ish like it basically looks like kind of kubernetes s, but on top of VMs on top of Amazon. So you know you have something where you push push a thing: they they basically baked images. Just like we baked containers there ma Mis. They had a you know these.

B

These Java projects, you can walk through the open source projects from Netflix, and you can see the routing and the log aggregation and in all these pieces that were the Netflix job, a specific way of doing that, and now you know you can map a lot of those same the capability straight to CN CF projects. And so then we don't all need to rebuild these things from scratch.

B

Do this I'm just gonna read.

A

Go back to that slide just for a second, because it's interesting around 2010! They did come out with platforms there, but just prior to that was when, like a thousand platforms as a services bloomed to use their infrastructure, so it was almost like. They saw the need for a platform from the thousands of small platform as a services and offerings that were there, so that kind of drove them to do some uniformity. There I.

B

Mean I would frame a slightly different, like Netflix was doing those platform as-a-service things before there were the public platform as a service right yeah.

A

The we saw like I, don't know there was a those a platform as a service for pearls forever. You know everybody help your own platform as a service and what they, what Amazon and Google really did, was kind of unified that and made that available as a product on their plot.

B

Well, I think I think the first thing they did because a lot of the platform as-a-service. This is like a deeper philosophical thing: that law the platform as-a-service failed right, yeah, elapsing, Google's, first foray into cloud as a service offering was App Engine, which was a platform as a service that didn't appeal to a lot of people because they had made it so Google specific.

B

You had to basically remap the concepts you were used to on to you know BigTable or whatever the kind of the Google version of it, and they could keep all these promises about scale, but it wasn't necessarily the paradigm people wanted wanted to use. So there's like another hour-long conversation about revolution,.

A

Yet someday we'll tease that out, because the like.

B

The insight that Amazon had is they actually went a step down right and they gave you okay like here, you can basically run OS. It's like the first version.

B

Ec2 was you. Can here's three different things? You could have any VM you want as long as it's black. You know they only have three different sizes and the other things that came before. That is also interesting. As the the first cloud service from Amazon was the q SQ.

A

B

Right and then s3 so anyway, there's like a like a whole nother hour of talking about the evolution of this stuff.

B

So this I'm, just gonna, read this out loud and there's a reveal, but all the stuff sounds great right, so remove friction from product development, high-trust, slow process, no handoff between teams do not do your own undifferentiated. Heavy lifting use, simple patterns, automated by tooling self-service cloud, makes impossible things instant. So these are great words sounds great to me. I did not write them. These are actually stolen. Word from work forward from conference presentation on the Netflix lessons learned by Adrian Koh Croft when he used to work for for Netflix right.

B

So to me, when you're talking about something like open chef and the platform and the goals that you have as an organization, it should map more or less to this, like, as you can write and obviously we're all in different parts of that journey. But but I don't I, don't think a lot of organizations never necessarily have this as a North Star or at least their behaviors, don't indicate that and and so the more that you can, the more that we can help each other get.

B

There then I think the better off Aaron's going to be because I like nice things, don't you have my students? Don't I got me at this point earlier, but this cloud conversation of all from the lessons learned, building operating these services and and that's key here. So these are services, software services, platform, services, infrastructure services. Now now the the bad thing is, you actually have to operate, those like they?

B

Don't they don't operate themselves right and and there's this is where some of the modeling or the operating models income is because you can make different choices about who's accountable for each one of these things in terms of the operations.

B

So what is operations? We keep saying this word and and there's like this whole body of operations. That means something in business context. It's totally different. What I'm talking about today, but for me operations is really about. You know the system, operations and and building this kind of technical infrastructure and and delivering things that way. So you know this is the DevOps days is the velocity conference. This is the SRU book, like all these things are part of it right. So it's like metrics, pretty much key to understanding.

B

What's going on now, you have some stuff you can. You can hopefully determine things, are good or bad and when things are bad, you hopefully alerts people when, when things need to change, you hopefully aren't doing everything manually got some automation there there's a lot of what I consider operations that really comes down to having mental understanding of the system and getting into the middle of it.

B

When things are going wrong and troubleshooting and doing the right thing and then hopefully you learn something from that and can make you know better changes or better, better automation, better monitoring for the next time, and- and you know, there's a again- this could be its own like 12-hour lecture series about each one of these topics.

B

Basically, but this this is the focus of DevOps day's conversations for the last ten years and there's lots of meaningful stuff available for you to go take in to that, but these are kind of like a baseline set of capabilities. So this is a slide from 2007 and I used it in a lot of conversations and a lot of presentations.

B

I was made by one of my friends who at the time, worked at Amazon and he was talking about so this is sort of the golden age of a puppet and and coming into like the beginning of DevOps conversations. So you have traditional operations on one side and kind of the new quote: unquote: secret sauce operations on the other side- and the argument here is that the colors on the graph represent quote: unquote, toil so the the humans doing work.

B

So the number of hours of work to maintain a system, and then the axe axis is representing this scale or the system scaling up so you're, adding adding servers for the system and those numbers seem laughable. Now, like oh, my gosh 20 servers, but at the time it seemed like big deal.

B

So so, like the argument here that jesse is making and you could go read this archived from 2007 I wrote another kind of follow up to it in 2010 about revisiting, but the the short version of this is that there's a new way of doing things that, if you put kind of in the in the design, the effort that you have this different curve for the amount of human toil, that's required to manage those systems as they scale, and this is 2007.

B

So in 2020 we should be able to compress that curve even more given the the platforms and the tools that we have available to us today. But the thing I want to draw here and as we go into the rest of it, is this notion that operations is the secret sauce can have advantage and I would argue. This is the this is the defining advantage of the cloud. Natives is their operational excellence, so this is 2006, and this is a pretty famous interview. I'll just read it as well.

B

The traditional model is that you take your software to the wall that separates development operations and throw it over and then forget about it, not in Amazon, you build it. You run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of service, so this is. This is an interesting quote in time. This is the you know.

B

The year, ec2 is launched basically 2006, and this is three years before DevOps is at work, but that sounds suspiciously like you know, a lot of the conversations people had in the DevOps community and and just to kind of give a shout out to what I mean when I talk about DevOps- and you know, there's Google, there's hours of me saying things about these words, but you basically have this kind of community of practice that involves having conversations, and these are the quote- unquote elements or what have you of these DevOps conversations?

B

This is a blog post that Johnny Damon wrote after the first DevOps days in the US. They identify culture, automation, metrics and sharing and jazz in in, like you know very very quickly after that I added lean with this notion of continuous improvement, Kaizen and, and so like, there's no shortage of DevOps content online, but this is sort of a framing for some of this other stuff, we're talking about with the capabilities. So going back to this notion of you build it, you run it.

B

What Vernors saying when he says you build it, you run it. The software teams are not building up all of these other services. Those exists for them inside of Amazon in 2006, for for a developer or development team. To get access to provision infrastructure was an API provision and database was an API right. So you you you've, got all this built in platform and infrastructure services available to the developer.

B

The developers are not building those they're not running those, they have insight into them, but what Vernor actually means for the for the quotable quotes, who Pizza team to run their software is that they run their software and then they have all these other things that are taken care of for them by the those other responsibilities. So that's something that I think sometimes gets lost in translation.

B

Where you see groups of people who are like, oh you build it, you run it it's like, oh well, you know you gotta, install the west's and and you giving developers who may not have that kind of context and expertise, a bunch, a bunch of things that they're not necessarily prepared to do well, and so there is some value to that sort of specification stratification. So this is Google.

B

Sre is not really part of the lexicon until 2016, so this is ten years after that. Vernor quote- and you know I already kind of said this earlier, but this is essentially Google's DevOps implementation and you know one of the reason I share. This mini go is if you go through and read this book like they pretty much check off all these boxes, you can kind of go that you can read for free a search, that's our ebook, and this is my recommendation for everyone: good DevOps, coffee, raid, dev off steel.

B

So wherever you find good ideas, you should make take full advantage of them and then the rest of this or for the next, like section this I'm going to talk more specifics about sre kind of in proxies- and this is this straight from the book. So this is the Dickerson's hierarchy of reliability from the sre book, and- and here you can see, the foundation of reliability from a Google perspective is, is monitoring brain? So then you you you, you have monitoring.

B

You can figure out a little bit more about what's going on now, you know something's wrong. So then you respond to incidents. You respond to incidents. Now, okay, like we respond to those incidents, we learned some things. We do some analysis that kind of goes back into it and then, and then at the very top you get up to this notion of the product like what.

B

Why does this infrastructure even exist and I'm not gonna, go through I mean that's our ebooks 500 page book, so for now, we'll just keep going that but I'm going to come back to this notion watering as the central thing. So this if anyone hasn't read the work paper, I think the Borg paper was like 2015. They published this paper about bored and it talks about the evolution of Borg and it talks about kubernetes and some of this stuff.

B

I also think it's kind of fun to point out and reflect on that in the 2009-2010 timeframe. If you had a conversation with someone who worked at Google and you tried to get them to talk about Borg, then they would stop talking to you right. So there's like this.

B

This shift in the understanding of what is a competitive advantage and- and you know, that's that's a fundamental shift and- and you see the you know, kubernetes and the ecosystem that was built around that as a reflection of them reframing, some of those things that they thought were secrets to them. So the Borg paper has this gold nugget that I think everyone misses they get focused on container scheduling and like fancy algorithms. So this is. This is one. This is straight from that paper.

B

Almost every task run under bore contains a built-in HTTP server that publishes information about the health of the task and thousands of performance, metrics and I have a standing, wager and I. Don't think I'll ever lose this that you will get more operational benefit from building instrumentation observability into your software into your applications into your services.

B

Then you will navel-gazing on optimizing. Your container scheduling, infrastructure right so.

A

Telemetry is everything monitoring.

B

Is the foundation reliability from in Google's perspective right? So people miss this, but what you can do monitor monitor monitor. So this is all straight from the book and we don't necessarily need to go through in-depth but service level terminology. You have to build up to that. To get to the talk about SLO s and to me as though those are the defining feature of s re, so it's worth building up, so you have service level indicators which would basically now you have some monitoring.

B

You can look at your stuff and say: okay, like here's the service level, and then you set service level objectives which we'll talk about a bit more and then that is not to be confused with service level agreements which tend to be contractual- and maybe you know, imply things about money and that kind of stuff, so service level objective is and I'll talk a bit more about it, but it's basically this three-way contract.

B

So, just for the sake of thinking about it and not necessarily to be exhaustive here, every service is different and in the book there's a thoughtful discussion about what types of service level, indicators and service level objectives might be appropriate for the types of things you're building right, so user facing systems are slightly different than storage systems and and so finding a way to to kind of map.

B

What you're doing not like a cookie cutter paint by the dots but like be thoughtful about what a service level should mean for the particular study and and then you know what what kind of that drives a bunch of decisions about monitoring, and then this is also straight from the sorry book. But this is the kind of the the golden signals.

B

The four golden signal is according to Google's sre book, our latency traffic errors and saturation, and so, if you are not monitoring those things right now, this is maybe a good opportunity to steal a good idea from someone else and then think very hard about. You know if you're not marring this stuff. Now, why not? And, and if you aren't, then what would it take to get this kind of information and start to think about the you know the meaning of each of these? For your particular context.

B

Now that you have s allies and you can measure these things, then you move on to this notion of an SLO, so the service level objective and to me this is sort of the defining the the defining quality of s area is really centered on SLO s and this contract. So you say you know we want this many nines or this many whatever for these particular indicators and that establishes an air budget.

B

You don't necessarily want to have a single dimension for an SLO on a service, but you also don't want to make it too too complicated, and you know this last point here is worth pointing out that the progress is more important than perfection right. So so what do we have measured? And what are we? What are we kind of looking at today? And what can we do to improve that system and make it so there's better tomorrow? It's more important than getting it perfect.

B

So the this SLO is really a three way contract between the developers, the the business and the operations. So the the business is saying reliability is important to us. If, if this thing is not available, then it's not delivering value developers are pushing the code and then operations is responsible for that reliability or this or er. And so what that establishes? Is this notion of a service level objective which gives you an air budget and then, in the context of the air budget?

B

The idea, at least from the aspirational kind of perfected version at Google, is that you can do things with that air budget, so I think it's worth pointing out. 100% is not realistic, yeah I'll argue it's actually impossible, as you get to these nine. So now you have an air budgets establish you. Can you could talk about this acceptable level of unreliability Ryan for some things that might be minutes or seconds and and especially gets interesting?

B

When you talk about building these services, that can operate with continuous partial failure right, so you so you have some some isolation, some concurrency, so that some fraction of your system can be down- and you know that dovetails into another kind of interesting sub-genre of DevOps, around Engineering and injecting failures and that's an all good fun. But now we're going to talk about today. So now you have a SLO. You have these consequences.

B

So when you're below your air budget in, like the perfected version of this you you you have this notion of the developer self self-service access, they can do all this stuff. When you, when you blow your budget, will go to this next slide, then then it changes the dynamic of these things. So in the quote-unquote aspirational Google sense of this when you're below your air budget for reliability, the dev team, the they deliver features.

B

If you blow your air budget, then the dev team capacity for work is focused on creating more reliability, and this is something lots of orgs balke like they have a hard time and there's a bunch of like political and organizational reasons.

B

Why this is hard, you know one, you don't have s lies in the first place in a lot of places, but to like getting getting this idea that we're not gonna work on features because stuffs unreliable, just kind of like seems to blow people's minds, but that that's the this to me is like the defining feature of true sre in practice at least says it's aspirationally espoused by google and then moving forward.

B

So you have this notion the best re at google when you, when you look behind the the covers, like not every project at Google gets sre. They actually start out very close and very similar to the mall that happens at Amazon, and then they earn the right to have s re support.

B

By being, you know, demonstrating their value, and then s re, take over the operational responsibility to call on the troubleshooting after services have gone through the the production, reliability review and the application reliability review to kind of retool the architecture to match, with the promises of the SLO that the authorities are signing up to keep, and so just to make this point. This is something I think gets lost because people see the s re and they're like. Oh, it's just like traditional operations.

B

You know we just have like we'll just change the name of our system and so sorry, the the the s re are no there to take toil away from the software engineers. The SRE are there to drive, toil out of the system, and so that's the whole point of this. You know one the e part of sre and to this reliability assessment to take the burden of it and in at least aspirationally from the book.

B

There's this notion of a toil that's being created by a service and according to the book, if you, as a software engineering team, exceed the SLO air budget too much, then the the sre have the right to push all of the operational burden back on to the software team. So it's like you can't get your stuff together, you're, causing me a lot of problems and works like ok.

B

Now it's all your problem until you get back into you know, if, if you, if you need to go outside to use the bathroom like you can't be, you can't be a puppy right and you got to go. We got trained you to do this right. So there's like a little bit of a dynamic power dynamic where the essary could push the operational burden back on the software engineers. I think it's also worth pointing out and I had a lot of conversation.

B

Philosophical conversation with people at Google about this but sre are effectively the architects of Google's platform, those platform services and those data services, in particular, I mean in a sense they were also essentially product managers there. So this is straight from the book as well. Sree builds framework modules to implement canonical solutions for the concern production area. As a result, development teams can focus on the business logic, because the framework already takes care of correct infrastructure use.

B

So when you're thinking about adopting a container platform and kind of building up these platform services for your own organization, I think it may be you're not gonna, adopt the s. Re model wholesale, but thinking thoughtfully about what are the promises that the services that we're building can keep with respect to you, this infrastructure used as we make them available to our developers to you know, go back to the lessons learned from Netflix, like we really want to unlock that appraoch development, and so we're kind of coming to the end of this.

B

Oh, my kind of advice. Is you don't think thoughtfully about these different services that you have to operate? Think thoughtfully about who has the accountability to operate them? Who has the tools to operate? I really like SLO? So if you think about the way that Google's architected itself and built this up, that you have these infrastructure services, each of which have SLO s and keep promises to these platform services right. So it's like at the bottom, you have the container scheduling you have Colossus. You have like.

B

You know some some thoughtful things about how you're gonna schedule, jobs and store data and the rest of that, and then you build higher level services. On top of that, they also have the rest of those that our promises kept. The kind of application and the software on top of that and then last but not least, you have the customer facing us all those because because hopefully there's some business at some point. So so the Vice here is not necessarily all like.

B

You should adopt sre practices but be explicit about these models and like develop your own kind of understanding about what you're doing now make that explicit in a way that you can evolve it in a meaningful way to something that is quote unquote, better right, progress over perfection and realize that everyone's kind of in a different place on this continuum of adoption.

B

You know whether you're talking about history, DevOps or whatever you know, there's on the Florida spectrum, there's there's lots of manual work and there's not very much monitoring and everything sort of done through these slow feedback, loops and ticket systems and and what you want to get to is this enabled like you, build the platform to keep promises with enabling constraints that gives you the confidence that you can allow your developers to have self-service access to these systems, because you can keep promises and then you know that dovetails into a bunch of interesting conversations about ITIL and like these weird misconceptions about segregation of duties and those are fun conversations at have, but we probably don't have time for that right now, so this is sort of like Andrews, simplified version of what you should think about as you're as you're kind of adopting and making things explicit.

B

If you don't have monitoring, if you don't have great monitoring, if you don't think you can kind of think about the four signals or what's appropriate for your services, that's a great place to start investing as you build. You know, monitoring capabilities. Now, okay, we no longer have the customers tell us something's wrong, we can detect. Things are wrong. How do we respond to that be thoughtful about incident response and the way you're gonna manage those and then kind of build yourself up through the this pyramid of reliability?

B

So what is DevOps? What is s REI, I promised I would give you more questions, not necessarily more answers, honestly, who cares I honestly care? These are this words. What works is more interesting question in my opinion, and in particular what works for you. You know what works today and what could work better for to you tomorrow and then really the end of the day.

B

If it doesn't go to Prada, doesn't matter production or didn't happen, can you put code into production and how fast and then, once you get it there, I can't keep it running. So that's some questions for you. Thank you for your time. That was a quick run through some thoughts and conversation I've been having recently about how to how to optimize. You know a kind of operational practice and process around your OpenShift investment.

A

Reasons is probably even more questions than it gives a model, and it which is great so, and that was I- think your plan for today was to give us some models, but it kind of some of the conversation that I can see in the tensions that you can see inside of organizations that are trying and struggling to adopt these models.

A

There was one diagram you had in there. That was the three-way conversation between business, ops and development. I think there was one drive visual there and when I saw that the other thing was the other component and you added it in later of the customers and what you often see is this tension inside of development groups and product management groups that are trying to deliver more features.

A

You know versus the stability and that dealing with the tensions of accountability, for maintaining stable and developing stable services, and in order to get that Optima goal of the sres coming in and taking responsibility for the operation of the software surfaces. That tension, I think, is one of the things that that that you have to tease out inside of your organization how you're going to I, don't know whether it's the Pavlovian reward them for good behaviors kind of things.

A

But that's I, think where we see most of the tension is product managers or developers have pressures on them to deliver more features. So.

B

That's what it comes back down to and really works where the organizational kind of misalignment or conflict comes from is that you have executives that have slightly different compensation models right. So it's like you have it like if you can't align that higher level mission at the top level of an org the chance that you're your front line, developers and operators are going to be lying to zero and sort of like we're. So like you have to really revisit fundamental assumptions about some of those organizational power dynamics to get to these models and.

C

People's misses and all kinds of things like you felt the bonuses tried tied to releasing features, whether they're, reliable or not, they're going to be released. Yep.

B

People don't like it when you mess with their paycheck right, that's yeah,.

A

Absolutely I think that the reliability of your service and your software is is key, but the tension of delivering I I see it every day and all of the you know, inside of Red Hat in the companies that we work with of wanting to deliver more features, more functionalities at higher scale and that pressure to deliver more but and try not to ignore the stability of it. I think that's really for me.

A

The other thing that was interesting to me early on in the whole slide deck and was the artisanal versus industrial conversation that traditional was artisanal and today is our industrial, and then there was another one about delivering the impossible. Adrienne I think it was Adrienne's quote there and the the myth, or at least the the hope that I have is that this industrial-strength infrastructure and these new practices will allow us to do the creative things that we want to write and that allow us and empower developers to deliver these new things.

A

These new features they need to a please new services, but having the complexity of having to understand what's under the hood. So when you see someone now come to the table with a new offering a new service, but they also have to understand what kubernetes is right, that's different than just having to be able to build a web application or a service or that offer a database offering. So there's all this extra that you're asking developers to understand- and that's really I- think something another cultural shift and we've seen.

B

Some ways I think that's actually wrong, like you don't like, if your developers need to understand more and more infrastructure to do their work like it's, not that you want to be ignorant of it, but you want your developers to be focused on the creative aspect of their domain and the value they're, creating not not sifting through yamo and and and so like. There's, there's, definitely an aspect of understanding.

B

More of the stack helps you make better or more optimal decisions for the global, but at another level, giving abstractions that are hiding some of that complexity. Lets you do lets you do things that you never could. If everyone's worried about every layer, yeah.

A

Definitely, though, Diane.

D

We got a question and check that here, so Mohamed Mohamud on Facebook asks how does the nature monolithic versus service-oriented of application impacts this process of going towards an SRE devops culture? Any suggestions on moving a 20 year old, monolithic, app towards this goal, I mean.

B

I think this is a very interesting question and a lot of organizations are being kind of forced to ask this. So so I think that you have to look at what you're trying to accomplish and I'm I'm, not from this school of thought that micro services are always better than monoliths right. So so thinking about what what kind of promises you can keep and and why you want to move mindfully to these architectures is key now, when you think about the operating model and these tools and these other capabilities that I talked about operations.

B

One thing to keep in mind from the very beginning is: when you go so, let's say: let's say you have an aspiration of the micro servers architecture. When you have micro services, you you have more deployments, you have more things that need monitoring right. So if you have a high fixed cost of deployment, you know in terms of the of the work, the automation, it's not there, the testing whatever to have you your confidence in the quote, unquote, release or you have like these unmonitored systems, and then you go to a micro service architecture.

B

Without that that platform support and these operational models being changed, you actually made more work for yourself right.

B

So part of the micro service architecture is predicated on having these quote: unquote: DevOps capabilities having these platform services available to you, because if you have that fixed cost of deployment, that's still high for each new service that you add you, you actually just buried yourself in soil right, so so getting it to where the fixed cost of a new deployment of a service is essentially negligible, is kind of where I would start with with moving towards that architecture, and then from there you know, go through the the hierarchy of reliability.

B

If you don't have good well factored monoliths that are monitored in a meaningful way, chance that you're gonna end up with a well factored microservice architecture. This mar in a meaningful way, is quite low right. So, let's build up that organizational competency, kind of muscle memory around those things with the monolith that we have and then meaningfully take pieces of functionality out of the monolith over time, because I also think that the the Big Bang rewrite approach to going from monolith to to micro-services tends to lead to catastrophic failure. So it's.

A

Also with them with a monolith, you probably can take a part of it and do that deployed that first and figure out the pieces of the monolith that you can break away and and try the new architecture out and deploy. That's I think.

B

Like a modernization, conversation.

C

B

To me to me, like I kind of break things into a few buckets and and if something is not causing, the operational burden doesn't need to scale to keep keep promises. I need for, for my org or or doesn't have like a need to change it rapidly right. So some things he needs to stay the same, isn't a problem, scaling and isn't expensive to operate. I'll just leave it alone, yeah right.

D

Yeah, there's no there's no reason to move stuff. That's working just fine right like unless you have some high need to do so for operational reasons. If.

B

I want to clear some of your features that are gonna, be you know in taking advantage of the agile, you know whatever kind of product development lifecycle.

B

Let's move that into architectures, where we can have more rapid feedback cycles with that customer engagement right so so, like that's a motivator if I know I'm having problems, keeping the reliability, the scaling of that particular architecture. Let's get to you, know the new, the new event-driven or whatever your vision is for that architecture to keep those promises or if it's expensive. For other reasons, you know it's expensive in terms of human costs or licensing costs or whatever. That could be a motivator. But if it's not one of those three things monolith for life, baby.

A

That's gonna be the new t-shirt monolith for life monolith.

D

A

No monoliths I know it's interesting, it's all, and so this conversation and more conversations like this will keep happening Oh on Fridays at this time, we'll bring more folks from the office of the GTO and as well as other Talking Heads and people from this space to help you all with your transformations and we're really glad that Andrew could join us today and make this happen, and if you want to get a hold of us, it's really easy.

A

You can tweet at him at little idea and we will post this video with his credentials and how to get a hold of them. We are also launching a transformation sig, so there'll be a landing page soon up on Commons OpenShift org, with links to this video and others, as well as a place to sign up for how to join us and get announcements about who's coming on deck.

A

Next, and if you have a topic you want to hear about, let us know we will try and find someone to talk about it or make you talk about it, which is even more fun.

B

A

So definitely do that so thanks again Andrew for joining us today.

B

A

Just just a pleasure and lots of food for thought there and, though take care and have.

D

A great thing, solder.

A

All right, stay safe, everybody cheers Cheers, you.