solo.io Meetups, 1 Jul 2019

Previous Meeting

⏯

youtube image

►

From YouTube: Chaos Engineering w GlooShot - Scott Cranton, Solo.io

Description

Scott Scranton presents on how distributed microservices introduce new challenges: failure modes are harder to anticipate and resolve. In this session, he presents a “Chaos Debugging” framework enabled by three open source projects: Gloo Shot, Squash, and Loop to help increase microservices’ “immunity” to issues.

event page : https://www.meetup.com/New-York-Kubernetes-Meetup/events/262237646/

A

Well, I'm, plugging and I'll apologize up front I've got a chest cold. So if I start hacking up a lung or anything I apologize, it's all part of the show it's working awesome already been reduced. My ROI run around customer service. I joined solo about six months ago before that I guess relevant to this group as I was at Red.

A

Hat I led the North American solution, architects for all of OpenShift, which was Red Hat's criminais, so I dealt with that for a long time, so I've been talking to people and customers around communities for a very long time. A lot about Sto and now we'll talk about I, guess with this environment, it's almost sort of the post sto world of like once. You have something like sto or any kind of a service mesh.

A

What are interesting things you can do with it, and that's really what I'm going to talk about today, a little bit cool and just in the topic of trend, giveaway stuff, so Christian post also works at us. He's a big talker on sto and generally has been for a couple of years. He's got a book. Is do an action, we're raffling it? If you fill out that survey, betty decided that the 734 and 63rd I don't know if those are favorite, Street, restaurants or bars, but anyways.

A

If you're those respondents, we can give you a free copy of the book. That's coming out.

A

So it's just a quick thing about solo and just to put a little bit of context about this so solo, we I think at various points have been sort of the number two number three commit her to envoy, which is the underlying data plane for is to you. Have anybody heard of well anybody heard of this to you? Okay, everyone about envoy, okay, pretty much everyone else! Okay!

A

So so we've been dealing with that we've been helping build products like API gateway products that just like a mini control plane on envoy kind of dealing with the traffic shifting part of it helps a lot of people that are migrating into kubernetes world from standalone. So we can kind of route traffic around.

A

Do things like that service mesh we've actually built some products and have been working with Microsoft around like the service mess interface API is built on a lot of technology that we've built an open sourced out, so it helps manage that we're not a service mesh, but if you have sto or at mesh or a console or link or D, it can help kind of manage that and we're actually finding a number of people that actually have two or three or four service meshes deployed within a cluster.

A

So the idea of Multi mesh I know it's like. Why would you ever do that? But people things like they have a lot of load up on Amazon a tap mesh is going to let a deep integration with with the Amazon infrastructure and then, but they might have something that also needs Ischia. So how do they deal with that? Or they might have multiple clusters with multiple service meshes deployed and they need to kind of have federated policy manager evans? We also help with that.

A

What this talk is going to be dealing with is once you've got that kind of world. There's all sorts of interesting things like if you've got sto or any kind of a service. Michelle talked about service mission, general side proxies. Suddenly you've got a lot of observability in the network because of all the sidecars.

A

You can do very interesting things with that, and so that can actually help you a lot with making your larger scale micro service environments more resilient, and that's really what I'm going to kind of talk about today, so really kind of dealing. The problem of when you've got a micro service environment. So now I've taken my nice happy monolithic, app I got a single process and I've now broken that into tens to hundreds of pieces.

A

It's a little more challenging in terms of how do I, how do I have to deal with that because you've really you've taken this now and you've you've taken? What will is a could still be a big but simple from a deployment topology perspective app and now you've maked. It you've turned it into a full-scale distributed application.

A

So now again, you could have lots and lots of things so, when you're trying to actually debug this or understand it, you've got lots of interactions and again if people are doing kind of DevOps methodology, each micro service is moving at its own speed. The things that you're connecting to are changing at speeds that you know maybe you're not aware of they're making changes you're, not fully aware of what happened there, but you're still dependent on it. So how do I? How do you manage a lot of that system?

A

And I love this quote from from Leslie Lamport. Well, I actually believe him at the high school in the Bronx down. Here, though, he went up to college up in Boston, but he talks about you know. A distributed system is one that prevents it from working, because some machine that you don't even know about has failed and that's basically, what a distributed application environment is. Is that something is broken?

A

You just don't know what in that application, and so how do you start kind of understanding what has broken and how to make your system kind of handle that in a better way, and that's that's really kind of what we're talking about with the Edit chaos engineering, and so this is, you know from a larger-scale perspective, the idea of you know when you get to large-scale environments like Netflix, and some of these pictures are actually old, like Netflix, really kind of coined the terms of like chaos, engineering back from 2010 when they created things like chaos, monkey chaos, who's heard of chaos.

A

Monkey that's been talked about a lot yeah, so you know they were seeing the case that random services would disappear. They had to make their system deal with it, so they built an application to go randomly turn off services. Then they built chaos con, which would randomly turn off regions. It would simulate entire region failures, so you know why would people do that? Well, if you, if you look at these pictures and I know, these have been talked about a lot, but the interaction connective graph between any service and any other service said his mind.

A

Blackened, there's no way that any person could know you know what services you're talking to and how its interactive and that's kind of what I mean by the scale of this of this problem and I. You know, maybe they inject a little humor right now, we've taken hermana, listen and have turned it into a murder mystery. You know who did it? What broke I don't know, so it kind of becomes a fun problem, aha, except when you're debugging it at 2:00 in the morning on Saturday, it's not so fun.

A

The other part of this challenge is a lot of our tooling has not caught up to this. So we have a lot of great. As you know, as computer scientists and engineers, we have a lot of great tooling that can deal with single process debugging. You know we can understand that stack. We can debug into that or we have language specific libraries again. Netflix did a lot of this, but then goes get like micro and others, so there's language, specific libraries that can help with you know, service discovery, retry logic.

A

You know like history, it's informing a circuit, breaking that's great, but again what happens if I've got some stuff than go and some stuff in JavaScript or type Graff and okay, so I introduced some language, specific libraries to help with retry and stuff. Well now, you're debugging, the fact that those libraries do it slightly differently. These circuit breaks, like those who's dealt with that problem. I know I've hit that a few a few people. So you know what I'm talking with like the tools are problem, the language specific stuff, and then you know code modification.

A

If you're changing this you're, changing your code to deal with the fact that something that somebody else just changed broke and again, if you're trying to deploy quickly, you're, never you're, never getting anything in front of this. So it's a problem. It's okay! So I've now made everyone depressed, there's, no solution! You know we must just start drinking more beer or you know, there's ways to deal with this.

A

So I'm gonna use the analogy. I don't have enough time to kind of talk about Cass engineering in a lot of detail, but I'll use the analogy of sort of vaccinations. So, if you think about a vaccination, the idea is that I am intentionally injecting a little bit of a virus into a person, a production system. My production system right. This is me: I'm gonna, eject it in with the idea that my body's gonna generate antibodies to help me fight that off when a real virus dependent.

A

You know it hits me and I mean it is a real virus, but if I get hit in a higher dose and a higher quantity, so that I can make me you know, Scott the production system more resilient to it.

A

So I'm gonna intentionally inject a failure, a virus into a production system to help make it stronger and that's really the core of what chaos engineering is and that's what we're going to take to our distributed application is we need to in production intentionally inject small faults so that we can see how the system behaves and does it degrade gracefully? You know if I lose a back-end system, they depend up like you know, I I can't get whatever capability is, but is it going to take down the whole system? That would be bad.

A

So how do I kind of manage that? So it's the idea of intentionally breaking it and you know in short, I'll give you there's a few principles. You know. Netflix has a nice little short a Riley book. That's okay! As engineering there's, a bunch out there, but kind of two principles. I want to leave you with and I'll talk about is one it. You want to automate these experiments in production.

A

You know it's great. You should do all of your testing that you do a normal system. You should do all your staging, but again, like those graphs showed you don't know until you get the system into production, what the interactions are with all of those other systems. It's only gonna happen in production, so you need to get systems and a culture where you're actually gonna do testing of these kind of failure, scenarios in production and kind of automating that and continuous running it.

A

The second thing you really need to think about is how to kind of minimize the blast radius. So if you interject a fault in your system and it fails okay, you know yeah, our experiment succeeded. It failed. You don't want it to take the whole application down. You want to kind of have it in a controlled way, so that you can experiment now I'll segue for a second who's, seen the documentary like Chernobyl, you know that's out now, I'll use the current topic.

A

I'll move away from my vaccination thing, so recent analysis on Chernobyl is actually prove that the real source of the failure that was a steam pipes burst because they had a coolant failure. It was actually they were running experiments to generate to increase the power loads when they were doing that automated systems were fighting against them. It's starting to put the control rods in place, so they were trying to increase the power loads.

A

These it was triggering automated systems to bring it back down, so they had this sort of fight going on in their systems that weren't communicating and a third party saw these behaviors said: oh something's, going really bad in our nuclear reactor and triggered an entire shutdown which caused a cooling system to fail, and we all know what happened from there. So this is a horrible example of bad blast. We use controlled. You know an experiment went bad and something horrible horrible happened.

A

So when you're thinking about chaos, engineering, one, you want to think about failure scenarios either you've seen in the past or experiments. You want to say like how is my system going to handle these failures and to make sure when you do the experiments? It's not going to take that? Hopefully, no one here is doing nuclear engineering and running a plant.

A

Please, if you don't do that, but you know if you're gonna do it make sure you don't take down your whole application so make sure that your tests have ways that you can kind of trol that very important. The other thing so I'll talk about kind of bringing it back to like how to service mesh fit into all of this, because I I kind of introduced. That idea that you know this is sort of the post, the post service mesh world.

A

You can do these interesting things, so service meshes in short again now: I'm, not gonna, do a whole talk. I would love to talk about it. I know, there's a social hours. I would love to talk about service meshes. What what the idea with like you know, envoy in particulars a core technology is a data plane. It basically took a lot of that retry logic circuit, breaking rate-limiting and moved it effectively to the networking glare I'll call it the application networking tier, but it basically moved it outside of your code.

A

So all of that kind of common language that flick, OSS micro. This moved it outside this allowed you to deal with any language you can deal with. It is just interacting at the network tier at this point, and so it allowed us to have that. It also gave us a lot of visibility in terms of how the network you know who's talking to who. So so this is kind of how service meshes it it also because it's managing its proxying all of the the network, the network communication.

A

It also allows you to do things like injecting false, because I can now, since I'm proxying the service you're, calling I can say, hey that service actually failed, I can inject false. I can also inject latency, so I can make those connections, so I can now simulate failures.

A

Actually I mean actually really introducing failures in your network without tampering with the back-end service itself, because I'm catching it at the network level. So it's kind of it's an interesting service mesh is enable among many things but relative to chaos. Engineering. It allows you to do some very interesting things around that and again, you know kind of the opposite of a lot of this.

A

If language neutral, because it's built in it, the network tier you don't have to make code changes around because it's it's all there and because of like the idea of sidecar proxying that a lot of the service meshes like at mesh in this do evolve, I mean link or D or all kind of bringing in you know it's it's I can put a policy, so anything I, deploy into any kind of a kubernetes environment is going to have one of these, so it just kind of comes along for the ride.

A

I don't have to think about it as a service developer myself, all kind of good stuff, so glue shot, is an open source project. So anybody can try it so quick glue shot so Lodi! Oh, don't try it now, because I need the network for a second, but then you can try it in a few minutes. So this allows you to kind of do those controlled experiments, so it leverages. It assumes that there's a service mesh in place that it can take advantage if it's using the you know the new service mesh interface standard API.

A

Is that we've kind of that abstraction? That's we helped pioneer, but that's there. So we can kind of work with any of the meshes any ones that are doing traffic shifting any of them that have observability capabilities we can deal with, and so we can run these and you can set what I'll call sort of stop triggers so that you can say. If something you know it.

A

You know, here's a trigger that if, if the network, if something is bad happening in your application, stop the experiment in an automated way, so you can define metrics against Prometheus so that you can say hey if my application really starts going really bad stop. You know, stop whatever thing: I hate, whatever faults I injected, so hopefully your system of how to recover back from that.

A

That's kind of where we're at at this point- and this is part of the set of open source projects that we've done again to kind of give you a sense of it. But you know: we've got glue shop, then loop is starting to because again, we've got observability in the network. If, if any request, a set of services are going through, if, if they're starting to return failure codes, what loop will do is it'll actually capture the entire call chain. So it'll know where how the request flowed through all of the different micro services.

A

It can capture all of the headers and bodies of the messages. So we can save any failures if things work, it just throws away the data, but if it's something like that, it can save it, which can allow you to then kind of debug that again relevant to you doing chaos, experiments something bad happens. You can capture what went bad and then replay that, hopefully not you're not going to do debugging in production because setting a breakpoint in your production system.

A

Probably but then you can bring it back to staging and actually replay it with the real data and try and debug that, and then we've also built squash with something we built when we were doing a lot of work around envoys, you're deploying it in docker containers or our Burnett containers debugging into remote containers is hard, so we actually built squash to help us debug it and then we've released that out to open source, so other people can play with it.

A

So it's a tool that can allow you to do debugging into kubernetes and other environments. Okay, that's all great. Let me show you glue shot if the demo gods are gonna, be happy to me, which they may or may not be so what I'm going to do up in the dock and glue shot that solo?

A

Do it I'm sorry I've got limited screen real estate, but over on the left, there's a tutorial, so I've kind of pre-loaded the beginning parts of the of the tutorial, which is you know like deploying the products deploying sto in this case it's all running on mini cube in my environment, deploying the book info app, which sort of canonic sample app for isseo, so I've deployed all of that onto mini cube running on my on my laptop and then we're gonna actually run a couple of experiments on that, and you can see that again.

A

All of this is documented. So if you do want to try it yourself, you can go download this play with it. Yourself have fun, we love feedback, there's a public slack Channel all of our engineers. Listen to so you find. If everything works great, do you find things that don't work? Let us know we'll, try and fix it. So we love. We love feedback on that front. The other things I've got going over here, so so a classic book get info app yeah.

A

It's running I've also got a prometheus console up here, so we can kind of see some of that and then I've got my command light, hiding so and a couple of things just just to give it a little bit of context so again that small picture in the diagram so that the book info app actually has has a product page. It calls out to a set of reviews, services and then that calls a rating service on the back end. So you know the review service is going to say.

A

You know whether people liked it or not, and the rating service is actually going to say how many stars again so we introduced a fourth I haven't I could update the picture, but we introduced a fourth review service that we're gonna build a test against to see if it fails, gracefully or not. We're gonna actually test we're going to introduce faults on the rating service. Sort of we're. Gonna see how the review services behave in the case of a back-end fault, okay, cool.

A

So let me so again get superb for a little bit of context and also test to make sure my systems running so in Prometheus this. This little um Prometheus graph is our as our exit failure case. So we're basically saying hey if we're seeing too many 500s out of the review service. So the thing in the middle we're seeing too many of those stop the experiment. You know within a minute sort of. If we get too many, it's like stop. You know things have gone, really bad stop!

A

You know we don't want that, but we're gonna intentionally fail the backend but I. This is our failure case here and then this all everything we we do, the solo projects that are that are dealt out there. You know we're dealing things with using custom resources, so we'll build those out and those help control the the deployed services inside of kubernetes, so I'm gonna, try and even be brave. I'm gonna actually use our live doc. I'm gonna copy the command, I'm gonna paste it into command line.

A

It's gonna work so now I've got an experiment running and I probably should have explained that one so I'll explain it while I'm running the test. So what this, what this experiment is doing is I define the failure case and then I defined, hey on the belt on the book and service, introduce 500's 100% of the time so I'm, basically completely failing the release service and we're seeing up in the book info and I reviewed it. It actually failed, both the review and the ratings.

A

So not the stars went away and actually the text for the review went away as well. So it's a complete like catastrophic failure. Review service just dropped very bad I. Don't want that, but I clicked and it came back pretty quick. If I had gone quick within a few seconds, it actually brought it back. So the it ended up deciding whoops and let me show you in Prometheus. So if I go update, my graph, we suddenly see this is the graph of 500s coming out of their view.

A

We see the spike our test now tests an area that little graph. You know this. This atheist queries are.

B

Here said: hey we've.

A

Seen lots of 500s well experiments gone bad, stop, you know, went too bad, so we introduced too much chaos slow down. So that's what happened so we see Prometheus spiked and then let's go look at the results from the experiment itself, so we actually.

A

Is a live demo, so hopefully all kinds of crazy things could happen so down here in the text. Hopefully big enough, you can see again, you can try it yourself, so it stores the results actually in the experiment, object self and then we see in this case hopefully probably can't see it's a little too small. But the this thing is actually saying it's a failure report. So it's it's saying that this experiment stopped because that experiment failure scenario was triggered. So it's telling us here's the reason why the you know we tried the experiment.

A

It's actually gonna keep all the results, the time snapshots of what happened. But in this case it said hey, we triggered the experiment, stop scenario and it's gonna tell us that we're like okay, great. You know that and there's actually report objects in this case because it stopped quick, we're not going to see a lot but I'll just show you real, quick oops.

A

Let's show you the contents, it might be more interesting. There's a report object, so there's the experiment itself that has some status and then there's a report. This case it stopped pretty quick, but it would actually keep time snapshots of all of the data so that you can pull it out of the the custom resource and do it. Obviously, all the data itself is also in Prometheus, so you can use that to do some analysis. Around okay I saw failures. What happened? Okay, so that was a case of yep I'm. Sorry.

C

C

A

So it'll keep I mean all these Prometheus is running in your app and then for the experiment. It'll keep the stats about what you told it to kind of watch what trigger cases so it'll keep that information as well. If you wanted things like the full call graph and stuff, that's why we're starting to build projects like loop and other things to actually kind of capture the request, headers and bodies, and the full call graft.

A

Is that that's outside of the scope of what glue shots doing but we're trying to build some other things, because we think it would be useful to have that kind of invert, okay, yep yep and it's sort of- and this was all around- you know, tracking things we have a catch box here.

C

If you want to ask questions, just ask for the catch book, so we don't have to repeat the question: it's a mic there. Sorry thank.

A

You perfect cool, so.

A

A

Introduce that up front, so we get swag so now, I'm gonna, try and see. If I get some more interactive, I only got one more experiment: I'm gonna run. So now people are going like hey I'll, be around later, so I'll give out more stuff later. So so let me let me go rerun the experiment, so we're gonna go. We realized! Hey. We read the review service. We tried our experiment. We realized it does not fail, gracefully at all based on rankings bad. So, let's roll back now.

A

In this case, all of the review services are actually already deployed, like that. That's part of the book info app, that's part of the clinic classic examples that is showing off is do traffic shifting right, you know, there's multiple review services and we can run different loads at it. So we're gonna go back and update that this little snippet over here we're actually using superglue, which is that that sort of management- that's the implementation of the service mesh interface.

A

So it's gonna allow us to provide routing rules, so we can say hey for any service mesh that that's under management. We can change the routing rules and it's going to go, translate it. So, in this case, I'm gonna introduce a change and say: hey: go to routing the to review version three. You know sort of the and go back to that, and now we're going to rerun an experiment and see if that one fails more gracefully.

A

So let me run another experiment. So again, this other experiment, same failure scenario, we're still looking. You know introducing a ratings of 500, so we're still gonna fail the backend rating service 100% of the time, and we could make it an intermittent failure in this case we're making a catastrophic failure and then we're gonna run this experiment for 30 seconds. The reason I'm saying this is I'm, hoping this experiments going to go a little better.

A

While we're doing this so now, I've got the experiment running so now. If I refresh the page, we see up on the screen, the reviews are still coming up. You know comedy of errors, an extremely entertaining place. The reviews are still coming, but the stars aren't and we're actually seeing a graceful fail, because it's actually saying hey rating service is unavailable. So this is a graceful kind of circuit breaking scenario back in service. It's completely failed rating service, but we're still providing you know. The review service is still working. It hasn't failed entirely.

A

That's a good thing! You know, we've kind of gracefully failed our application. This is exactly the kinds of things we want to test in terms of the interactions between it. How is it going to deal with in this case I'm doing 500 I'm completely failing it, but I could be doing what if I introduced your high latency? What if I introduced a 10% failure of the ratings how's it going to behave? No! No!

A

So this is exactly what you're going to do: chaos, engineering in production to kind of test what those interaction patterns are going to be so 30 seconds has gone by our test has stopped. You know we're back to the little stars, so let's go look and see what from from a different experiment what the reviews are.

A

So, in this case, the results here is the experiment succeeded, which means it ran the full 30 seconds and if we go look at Prometheus re-execute the careers we see you know here now the number you can't really see it, but that it's a flat line at zero. So we saw no 500s out of the reviews, but so we didn't trigger any failure.

A

So the experiment in that sense is considered a success when we didn't trigger the experiment, failure scenario, but now we've got data around how it behaved, and we also now have insight around the fact that you know how the review service kind of, in this case, gracefully degraded when the rating service went away. So it kind of gave us some insight around how it's going to behave in a production environment.

A

So with that, let me guess: I'll I'll stop I mean. Oh, so again everything I showed you you can you can download it glue, shot, dot, solo, dot, IO, you can download this, try it all yourself. It's documented! In that there's a tutorial link. You can try it all yourself in your instance.

A

It works and all the different clouds and stuff give it a try there is that raffle, if you're one of those weird numbers, you know we'll give you one of Christians books as well and be happy to talk to people afterwards about anything, though. Thank you very much.

B

Hey I couldn't really see the screen too well, but it looks like you make your own type right, like you had your own kubernetes yeah,.

A

Custom resource, so those things I was showing what you couldn't really read on the on the screen: desert yeah we've got a bunch of custom resource definitions, so in the case of glue shop, we've defined an experiment, customer resource and a report customer resource. Okay, all right that way you can configure. So what I was showing is like when you're configuring, the experiments you're just defining variables in a custom resource. Okay,.

B

So I'll just yeah. Well, that's.

A

Exactly right, so glue shot is a custom. Is a is an operator if you kind of using like the chorus thing. Basically an operator, that's gonna, watch those custom resources and then make changes based on that. Okay,.

C

This seems completely at a network level. Have you thought about doing other things like mangling response, Jason to see how upstream microservices, respond or I don't know, injecting some data Falls, so you're.

A

Talking about like corrupt message, scenarios.

C

A

In theory, we could do that at this point. We were just looking at it from just introducing network failures. That's a very common scenario: I think data corruption. We could look at some of that. Some of that starts to fall into kind of functional testing as well, because it's sort of how are you going to deal with that so I get. You know. That is a scenario you need to deal with. So it's certainly possible envoy allows you to to deal with requests and response transformations.

A

You know as part of the filter chain, so it's something we could absolutely introduce, though we love contributions as well. So if you want to go contribute, you know, we'd love to you, know open a PR. You know cool. Thank you very much.

A

Thank you. Scott.