Cloud Native Computing Foundation KubeCon + CloudNativeCon China 2018 (Shanghai), 20 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Stop Hitting Yourself! - Michael Russell, Elastic

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

A

Hey everybody. Thank you for coming today. The attire with my talk is stop hitting yourself. I was just talking to a gentleman in the back on my way in, and you came here pretty much just for the title. So I promised a few people I'll at least hit myself at least once just to make sure everybody got what they came for. So let's get started, I've got a pretty packed slide deck today, okay, two quotes to start off with computers.

A

Do what you say so far should do what you mean I'm, not going to explain it too much now, but keep this in mind as I continue on. Second, one software is easy to use but hard to run.

A

So this is me my name is Michael Russell I'm, an Australian living in the Netherlands, a new tract I work at elastic as a software I'm, a software engineer for the infrastructure team I, currently respond to Michael, Mick, Mickey, Mike, Mikey, Russell, rusty, crazy boss and I'm. Looking for new nicknames, potentially a Chinese one would also be great to add to the list.

A

So if anyone has anything just find me out for the conference or at the booth for the elastic stuff and I enjoy food travelling gaming and also I forgot that I left this in, but photoshopping myself into ridiculous situations like that. Okay, so the title of the talk is stop hitting yourself. If anyone not familiar at the reference, this is from a TV show called The Simpsons and in The Simpsons, the the guy on the left.

A

Here his name is Nelson, who was a school bully and the guy in the ride is a bit of a nerd, so the kind of guy he writes software and what he's doing is he's filming him and he's grabbed his arm and he's saying, stop hitting yourself. Stop hitting yourself stop hitting yourself and the be sort of connection to this is. This is how I feel that the experience is a lot with modern software, where you go to deploy something and you immediately get an arrow. Oh, you need to set this configuration setting.

A

You fix that you deploy something else and then you run into the next problem, and this is a pattern that I've seen throughout throughout the years of my career, which just keeps coming back even with newer technology. So after writing this speech and reading it back I realized. It was pretty negative because I'm pretty much complaining about software, the whole time and showing the negative side of things, but I just wanted to say that I really enjoy what I do.

A

I really like kubernetes I didn't get paid to say that and they didn't actually pay me at all. So you can believe what I say. Okay, so it's gonna be a sort of storytelling format. I've got a few different stories working from the older age of technology right up until modern day with kubernetes, so we're gonna start where everyone was around five years ago. Writing bash scripts. So one day a developer came up to me and he said: I need you to do a production I need you to do a restore of our production database.

A

I've been working. The company for a few months, I, didn't know that we had a backup, sir apparently we did, which was good to know and when I asked him about it, or he could tell me, was there's a bash script and a cron job somewhere. So this was off to a good start. We at least had some form of backup method. The question then was: will it actually work so to go back even further in time? This is what the script looked like 20 years ago, the original version.

A

So at this point it's not too bad. It's just a my single dumb command. It's putting that into a backup directory which is hopefully somewhere, that's gonna lost, and you know this is not too bad. This was okay, but after this is how the changelog worked when I opened it up, I removed a few of the change logs, but over the years lots of advancements have been made once the changes have been done, little optimizations and things- and you know this script- had history I thought.

A

If this has been running for 20 years now, it must be a pretty good script right. This has been doing it: production backups for 20 years. I can trust this. So this is what the script looked like from memory. I just had to make it up a bit and at first glance this looks pretty good, particularly the the top line up there. The important do not remove warning and set II. So if anyone who hasn't done much bash flip ting said he is really really important.

A

So this is the first example of stop hitting yourself and luckily someone else had already run into it. So what's that II does, is it says that if any of the lines fail, I need the commands? I want you to actually exit and say it failed? Otherwise, the only thing needed for successful backup was this line. So before the guy out of this five years ago, the script was always successful, because the echo backup successful always worked.

A

So, let's keep going so some of the other problems that were in the script we're not going to go into detail because it's not a talk about bash scripting or my sequel, backups --is there's no table walking and they were lucky.

A

They were doing the dumper table, so that was already a bit of a concern and then I would then need to do the restore in reverse order, which isn't too bad definitely possible, and the other thing is that a cron job which sends an email when it fails, is pretty much the the lowest level of monitoring. You can imagine if you don't get an email every day, you're, just basically assuming hey. My backup must have what you don't know if it actually ran.

A

If your email server is broken, you don't really know anything so, there's actually only a or things that I think email is actually useful. For so one of those things is a human sending a message to another human when they already know that human. So that's one of the things and the other thing is saving any articles for yourself or cat photos that you want to send to your girlfriend later. That's pretty much.

A

The two things I find email useful for something that's not so good for is any kind of monitoring alerts, computers, automated systems needing to contact humans and finally, alerting the fire brigade. That's from the IT Crowd, it's a real scene, so if you haven't watched it I highly recommend the show. So the surprising thing was is that it was a backup script with a lot of history in around for 20 years, I really was stole script and it actually worked and I was pretty surprised and the site was working.

A

It was up when I searched for a job. It was a job website back. She seemed to be some stuff in the database, so I was pretty happy with that, and I was quite surprised. So we now had a backup that was restorable, which was a pretty good upgrade from earlier in the morning when I didn't even know we had backups.

A

So that's when you actually start to look into it and find out. Oh actually, it may have restored, but is all the data there? So the interesting part here was that the search index that I was looking at was a search engine. So not the my sequel database itself, so that one was okay, but the actual database was more or less completely empty and then we had a look. There was only eight jobs online when they should have been around 25,000. So at this stage I thought, maybe it was just a bad backup.

A

Maybe something happened this morning. Maybe yesterday's backup was actually fine all the week before. Let's, let's see what we have and what I did is I had a look at the size of the backup for the previous, so this was a daily backup, I think so for the previous week and I saw this, which was pretty horrifying because you've got imagine.

A

This is a script, that's been running for 20 years and what you would expect to see is that the jobs table was append-only, so every job from the whole history of the company was in there. So you would expect to see a five megabyte, backup approximately and then everyday getting a tiny bit bigger. So as soon as you see no backup so they're even remotely similar to each other, you pretty much can conclude. This was one table as well.

A

You could conclude: okay, we don't have a single valid backup for our job site in probably 20 years, since the first version, so that was a fun discovery so now to show you what actually happened in this case is that five years ago someone had found this set II and you can see in the change look here in 2010 I've set II to make sure backup actually fails.

A

So it's quite a good feature request and the interesting part is this: next change log, so someone else came along and once the backups were actually failing, Baria would have fixed the reason that was failing, and this meant that all of a sudden there was a greater use of data for the backups. So suddenly this was filling up.

A

So someone thought we need to have a way to compress the backups to save some space and the line that got added in it's very hard to do a change log and dipping in slides that I'm doing my best. But the important part that came in was this part here.

A

The pipe in the my secret dump command into gzip and what's interesting here, is that, as we said before, the default for a bash script is that as long as the loss line works, that's okay and everything else can fail, so that's fixed by the set EE and then the new feature here which is needed is set a pipe fail and what this command means is that any command on the line? Sorry, this one, the only important command in whole script. If that fails, that's okay!

A

As long as the pipe into gzip, which could be piping, nothing is successful, and that tends to be always successful. It will still consider it to be success. So what was happening is the moment. Someone added that compression into it.

A

We were back into the situation where every single backup was successful again, so in total, looking at the dates, there was a good 12 days of valid backups, but we didn't have them anymore, but it could have been nice so to get back to what I said earlier, software is easy to use, but hard to run. Data is easy to backup but hard to restore hard to restore and verify.

A

Okay. Let's, let's talk about something a bit more modern. Hopefully, we've improved from our bash scripting days. So, let's start with docker, so everyone has used. Docker is probably done this at some stage or probably multiple times at us at docker is working. This is the helloworld tutorial for dhoka dhoka, one hello world, and not only do you get a hello world message, but you also get a nice bit of output about what actually happened to make all of that magic work.

A

There's one particularly number two is what's very interesting here, so the docker daemon pulled the hello world image from the docker hub. That happens to be a haiku and it's been verified by our own haiku block called mbaku. You can ask me about that later, but what's interesting about this is that none of that sentence is actually true. Pretty much. Every part of it can be picked apart and shown to be something who's. Gonna set people up to hurt themselves later on. So the first part is that the purple writing should be the doctor.

A

Daemon checked: if the image hello world with the tag latest was already downloaded and available locally. If it already existed, it did nothing. The next part is that it doesn't actually check if it's the right, hello world image from the docker hub. If you happen to have another image called hello world with the tag latest, it would also use that as well.

A

So then, another layer on top of this is that the latest tag doesn't even have to exist. Latest does not mean latest it's just a name. It's just a word. It doesn't mean anything, and just because you have the latest tag.

A

Locally doesn't even mean exist in the registry and for anyone who hasn't run into this before the the complication of this is that if today you on docker on hello world and then someone else runs it next week, and then you run it again, you might get completely different outputs in different versions of the same container.

A

So here an example of whether I can go wrong. Luckily, we knew about the latest hard going into this and when you, let's not use the latest tag, I wrote a blog post. It's a bad idea. Let's not do that.

A

So at the start, this is what our Daka file looks like so phone and sent our 6.66 sent to us is our base image, and then there was a bunch of business logic that I haven't included and, in the end, adding in the source code for our application and the reason that I'm showing this and the tile the slide is.

A

Someone came to us and said you haven't patched a production in around six months, which happened to be the amount of time we've been running application for so pretty much translated to you have an ever pouch reduction. So when they looked into, we found out exactly what I was explaining before it's only going to connect and pull in the latest tag. If that tag doesn't exist locally, so the build server already had the six point, six tag, so it never will ever it will never ever update it again. So we look this up.

A

That was an open issue and docker note. It's now called mobi, of course, and they said you need to pull the image first. So, like that's a pretty easy fix, we updated our deployment, build scripts to include docker pools, sent to our six point six first and then we could go to our application and that should work just fine right, nope, not yeah. So the next problem that we ran into is that a version tag is completely up to the owner of the image of how they want to implement this.

A

So this is the current documentation for the center Wes image, and the short version is basically, if you use that 6.6 version they're going to match that to the ISO release all sent to us. So back in the days when you used to download a CD in Burnet, that's a unpatched version of sent to us which they've done on purpose and their recommendation is to run run. My update and yum clean all in Davo. So this was good news. We found out okay, here's the reason why it's clearly documented.

A

We just need to add this in so we added with him and everybody celebrated. We actually did a deploy and confirmed yep it's updated now. This is perfect. So that's why we're very surprised when one month later I'll hit myself again. Why are you hitting yourself why having a batch production? So we don't learn about the dockable cache, sorry for anyone who doesn't know yep any of those commands that you add inside like this one, the way darker determines whether or not to run these is based on the actual command itself.

A

So the first time you run this, it is actually gonna, do an update and update everything. If you then run it again with directly after it's gonna. Go, though you didn't change the update command. So this there's no need for me to run that again right and to make it even worse. We actually checked it twice, but on those two example times that hit different build servers, so each time we saw it updating, so we really thought we'd solved it. So, once again, the solution here is to add yet and over I get another default.

A

So this is that you now need to pull the image first to make sure you've gotten any changes that are in there and then build the image with the node cache, and the sad part about this is that one of the main reasons we won for docker was all of the caching technology that we got that we could build our images very quickly and deploy them, and now we need to download every thing on every application deploy again well making these slides when I went to the documentation for docker to look at the hello world example.

A

I noticed something very interesting. If you look at the output of this command here, let's see if I can get back quickly, that's a long way back. Okay,.

A

Sorry to give everyone a seize you about, but the interesting thing here is that the output is actually different from the current version and when I first server I was really hoping that it was because they hadn't pulled the latest tiger young, but it turns out the docs are just out of date because you can see it pulling the image, but that would have been quite a nice finding.

A

So, let's get on to kubernetes, that's what everyone came here for not to talk about bash scripting. Is there a bash corn that would be pretty nice? I can talk about so I was hoping that kubernetes had taken a big step and actually tried to fix some of this tagging behavior that Dockers heart, which is cause I'm, gonna guess at least some people here also the same pain that I've run through a few times, but unfortunately, kubernetes maintains to try to be compatible with the docker way of doing things.

A

So the imageable policy is also set to the default of docker. But luckily you do have a nice option. Immutable policy always so you can pull in the latest tag. If there are depending on your tagging strategy, so let's now look at a hello world example for kubernetes. This is a very basic one that a lot of people should have done so just starting an engine x container and then we're going to expose the deployment with a load balancer.

A

Well, it's really nice about kubernetes. Is that when you want to come on like that, you can actually take a look and see what has kubernetes generated for me and when you take a look at the emma file, you look through this and go yeah this. This looks good. It's it's figured out rolling updates as a setting for replicas that I can tweak, so I can make it highly available in Tripoli.

A

Seams are always by default, even though it says it's not when you use it keep co2 wrong, come on and resources, no I, don't have any resources, so I guess that's empty. So that's sort of the default experience that people will get when deploying an application. They might take this and tweak it slightly to their needs, but that sort of where this goes wrong again.

A

Here's a few questions that kubernetes did not ask you and did not fill out in the template and as I'm saying, let's just think how you would answer these yourself and don't just listen to my answer so much for deploying your application. Is it okay? If all of the containers are running on the same Hurst, no you'd rather spread them out right. That's why using kubernetes? Is it okay? If all the containers go down during a rolling upgrade of conveys itself, no I I don't want that either. I'll turn that off.

A

Please should these containers always run or just sometimes no okay should I check if the containers healthy, hopefully that's an easy one. You asked for three replicas. Do you really want that many, or do you just like that number a lot? How about two is two okay? So, let's, let's get into those with a few practical examples, so, firstly, the update strategy, like I, said before at first glance, this looks pretty good. You see it's a rolling update, not surge is one max unavailable is one, so you think to yourself.

A

What will that actually do when I do an update? So this is what I would expect to happen. I would expect I deploy a new version that new version will actually get deployed, and the moment the version is deployed its then going to remove the old one. That's the standard way of doing rolling updates. You don't want to take one down and have any downtime, but here's what actually happens once again.

A

I'll hit myself just for the guy up, the back there you're welcome and what actually happens is you're going to deploy the new version at the same time that that new version is being deployed, kubernetes is gonna, say I'm allowed to have max edge of one. So it's gonna have two containers running at once and max unavailable at one.

A

So as soon as the first one launches it goes, I'm allowed to have one unavailable I'm only supposed to have one: let's kill the old one, so at the same time that is being launched and starting, the other one gets removed, and what actually happens here is that if your new version has a docker image that was missing or a tag, that's gone, it's actually just gonna.

A

Take down your application, which is, is pretty unexpected, I would assume, and for an engine X example, people are thinking, you should probably have more than one replica, but there's definitely lot of kubernetes use cases where you are just doing a single replica.

A

Well, deploying a new version should not cause your application to go down for a controlled change, so what you actually want, of course, is this not Mac's unavailable to zero and there's actually a couple of reasons for that, and the other reason is that if you were to set things to three replicas, if you remember before one of the questions was, do you really want three or D just like that number, and that's the example?

A

This also fixes is that, once you say max unavailable zero, it's gonna always keep three during it up rolling upgrade with max unavailable of one what's gonna happen is it's gonna start a new one and at the same time remove one so during an upgrade you're gonna go from three replicas down to two, so you're just willingly giving up a third of your capacity during an update, and then you know, you're still open to node failures and everything else. So that's that's. Where that do you really want three or do you like?

A

The number comes from so here's the the next thing by this stage, you're, probably thinking I, fixed the rolling update strategy. I've got an application that just can't be taken down. It's pretty much invincible at this stage. So, let's see what happens if.

A

Nettie's upgrade so quite often, this is not being done by the users of the application, which makes it even more surprising, but imagine we've got our nginx deployment running we've got three replicas, we've got everything configured properly, we've tested it all nicely and all of a sudden our application goes down and we're logged into a kubernetes cluster, and we have a look and we see three containers not running, and you think. Okay, that's that's not ideal and the defaults. If you remember question before, is it okay? If all the containers start working?

A

That's exactly what happens out of the box. If you do a well and restart of each node and the container fails to stop again somewhere else, it's actually gonna just keep going, but luckily there is a solution that you do need to configure it. So that's where port disruption budgets come in with a pod disruption, budget, you're, actually able to define, tell kubernetes from a user point of view. This application can only have one container unavailable at any one time.

A

So what's gonna happen in this case is kubernetes is going to try and upgrade a node. It's gonna drain: it move the container somewhere else. If the container fails to startup kubernetes gonna stop the whole cluster upgrade just for you, because you're special and it's gonna annoy everyone else until you fix it, but you know you've at least prevented some outages for something that you may not control next example.

A

If a kubernetes, no crashes, everyone here is probably using some form of cloud in some way, probably AWS or GCP, and have at some stage in their lives, had a virtual machine, actually crash or die or just fill up, or something else happened to it and what's interesting about this is that by default, if you deploy something on kubernetes, there is no guarantees about where those containers will run.

A

If you're running a three node kubernetes cluster, it's very possible that all of those containers will end up on a single kubernetes node, a single virtual machine, and if that machine does actually crash and is unable to connect to kubernetes the default waiting time is 5 minutes after 5 minutes, kubernetes will go, I think the nodes not coming back now. Let's start those containers somewhere else. So 5 minutes of downtime, just because you haven't set up your pod ante affinity. What are these are very hard to say so?

A

The way pod anteye a finiti works is that this is a way to define where your containers can and can't run, there's also affinity and anti affinity, and in this example here the the only really important line is this one here, the topology key and with this you're able to say for this nginx deployment I want to make sure that all of them are running on a different host. So this is going to guarantee that they don't end up on the same node.

A

It's then also possible to divide this up by zone or on any other labels that you might have on your nodes, but just by adding this in, you can simply now say, make sure the containers separated and running on different hosts, so that I can actually survive. A failure feels like something that should be a default, but it's just not so at this stage things are pretty battle-hardened and you think there is no way.

A

Something could take this down that we could fix by adding in some configuration and you'd, be pretty much right, we're nearly there and because you're doing so well, some other people in your company say hey that kubernetes thing actually seems to work and you guys are now getting work done. Can we have some of that kubernetes and, of course you want them to use it as well. So you help them out.

A

You show them how it works, but you know they've come into it later than you and they've seen updated documentation and they've also seen this resources thing. So I thought, let's, let's configure resources, because someone actually read the docs and understands what it does. So at this stage, they've deployed a few applications.

A

They've asked for some CPU and memory requests, and now your application has gone down and you can't figure out why, until you actually have a look at what the resources do so out of the box in kubernetes, if you have, if you use no resource definitions, it's considered to be best effort. So in a world where all resources and kubernetes are best effort, it actually turns out pretty ok, because if you have two containers running on a host and they both want to use all the CPU they'll get half each.

A

If you add a third one in they're now going to get a third each and this works great, but then what happens? Is someone comes along and does things properly and actually says, hey I want some CPU kubernetes is gonna? Go, oh, that's! Ok! No one else wanted any I'm. Just gonna give all of that to you instead and it's gonna starve the containers of any resources, and it was actually just completely delete them because their best effort, you didn't tell me they were important.

A

You said you wanted me to run them, but you didn't really say they all that important. So this is what the resources stuff looks like. Hopefully everyone seen it before the the quick explanation of how it works, because a lot of people get confused between limits and CPUs is limits and requests. Sorry, so limits is what it sounds like that's a hard limit. So if you set the CPU limit to 1,000 million CPU, it's gonna just limit your process to that speed.

A

The memory limit at 500 megabytes when you hit that it's gonna do an out of memory, kill a bit like Linux, does and actually restart the container. The request is actually going to reserve these amounts. So if you don't add, any requests in you're gonna run into the same problem before it's kind of like best effort with a limit, it's almost worse and not having anything at all.

A

But a request is gonna guarantee that at any time your container is going to be on a node that has at least one CPU and ha and 500 megabytes of memory for you and then, of course, you're gonna want to set these as well. So if everyone starts setting how much memory they want, everyone's just gonna say: I want 20, CPUs and all the memory. So what you can also set up is you've got resource quotas, we're on a namespace level.

A

You can say this team, or this namespace has access to 50 gigabytes of memory and 20 CPUs and then you're gonna find out. It's very painful to keep telling every single developer every single pull request, don't forget to add in resources, so you also have default resource limits where you can define a default setting for anyone who doesn't set it.

A

So if you deploy an application, don't set any resource limits, you're gonna get a certain amount by default, which is probably not going to be enough, but enough for them to realize that they need to actually do it. So it's a good way to hit them back. I guess is a better way of putting it. So I think this is the final one, and all of these are based on a true story of constantly trying to get an application working running into problem after problem.

A

So in this case your website is functioning, but someone has a look at the logs and you're losing about one in ten requests and you have no idea which one it is of. What's going on and after taking a good look, you do find out that one container has got itself into a bad state and probably shouldn't be receiving any traffic. So the nice answer to this is you have readiness, verbs and liveness verbs both of these support.

A

Doing HTTP calls, but also just running a shell command and once again it's another thing that people get very confused. I would urge everyone to at least go for a readiness probe. Readiness probe is used for a whole bunch of different use cases in kubernetes. So when you're doing an upgrade, it waits for the readiness probe to be going when you have a service with a load balancer.

A

The readiness probe defines whether or not it's actually inside of the service when you're doing kubernetes upgrade the port disruption budget uses this ready in this probe to decide. Yes, this container is actually healthy. Again, the liveness probe functions the same way, but this one is only used to restart the container if it's failing.

A

So this should be something normally a bit simpler where, if you know the HTTP server is not responding at all, let's try restarting it and see if that helps so here's the thing that you guys should take a photo of to summarize all this. If you don't want to remember it, but this is just all of the things that I've gone through, that, if you're not setting these for production already, you really should be so just once again. Resource requests.

A

Limiting quotas too sure that you've actually got some resources for your containers, liveness and readiness probes to make sure that when they are failing that, it doesn't take down your whole application and that requests aren't set into a bad service or disruption budgets to make sure that kubernetes upgrades doesn't take down your application ante Finiti rules to make sure a node failure or zone failure is not going to take down your application upgrade strategy to make sure that a normal deployment is not going to take down your application and then the things which I haven't covered, which I won't have time for today, is to make sure that you have proper logging monitoring and alerting alerting.

A

As we saw before with the email example. Cron jobs is way more important than actually having the monitoring in the first place, actually being aware, being able to trace back, see what happened and finally, once again, another true story: a way to restore your applications. If someone accidentally deletes the kubernetes cluster. Luckily we keep all of our configuration under version control and our secrets have persisted outside of the cluster in hushka vault, so it is possible, but it's something you need to think about in case you ever have to do it.

A

It can be quite difficult when you have services dependent on each other as well, how to restore stuff in the right order. So after writing, this talk. How long have I got ok, so after writing, this talk I had to actually start questioning. Why is software like this? Why do we do this to ourselves? Why do we make us hit ourselves? So part of the reason here is that it's really about those first impressions.

A

If you wanted to use kubernetes- and you first had to listen to my talk before you could get an application running, I, don't think that'd be many people at the conference here today. It raises a barrier of entry.

A

People can't just play around and try it out and there's also a lot of use cases for people where it's completely fine, that if an application is down that they can just talk to someone and someone can log in and fix it and the final one which is I think is the main reason for this is purely backwards compatibility when some of us were first using kubernetes.

A

None of these features existed yet if you suddenly upgrade it to a new kubernetes version and it started spreading your pods around when they used to all run on one host, that could be pretty unexpected and that would be breaking backwards compatibility. So here's a few ideas I have to try and prove stuff. One is actually talking about it, making sure that people aware- and this doesn't apply to kubernetes. This is about all software reading the documentation actually, testing appears of how does failover happen. What happens if I do something silly like breaking the configuration?

A

How does it respond? Maybe you're so creating some documentation for these? Here's an example of a battle-tested application. Using all of these features, maybe it's even possible to start thinking about changing some of the defaults, and maybe somebody should become default settings a bit like the rolling upgrade strategy. Another potential idea is to go even further and have a code level implementation of this a way to actually say this namespace this cluster is in production mode and it's not possible to deploy it without these things being configured.

A

And finally, it could also be something potentially added to a home chart template. So another quick final thing for me, deploying to production is easy. Running tests locally is hard. This is a little plug for a tool that I've been working on lately, which focuses on the opposite end of this problem.

A

So right now, I say thank you that it's actually too easy to deploy something into production, currently I'm working on home charts for elasticsearch and the biggest concern that we have is that people can now deploy quite a complex application with a single command and that's gonna lead to a lot of people, making mistakes who don't know elasticsearch, who don't know kubernetes, who don't know how to run a cluster and it's sort of setting them up for failure. A bit where you're worried about this.

A

So this tool is about approaching things from the opposite angle that, while I find it so easy to install a persistent, highly available cluster on kubernetes and not understand it. I find that the opposite end of the spectrum actually running tests in a development environment is way too hard and difficult. A lot of my effort and time. My job goes into actually making sure that something that I write will work on Windows Mac Linux in the CI environment, on Jenkins, on Travis, CI and everywhere else.

A

So the way that loop works is if anyone's ever done, the trick with a docker run command. Where you mount the current directory inside of the container and put on RM and Ti and run a command, this is kind of that, but on steroids.

A

So it's also going to forward your SSH agent in the into the container, even for Windows and Mac OS, so that you can run ansible from Windows, Mac or Linux or in your CIO environment, using the exact same versions, exact same system, libraries, it's also going to forward environment variables, attach the dock at Saco. It also supports proc, seeing any commands host for any specific tooling, like key base, where we have like a use case where we need to decrypt something from key base to unseal vault. So it's got a.

A

The idea is that there should be a single way for everyone to run a command and have it just work, but also for CI, and that was it. Thank you. Everyone. So much for coming.

A

Do we have any questions? Yes,.

A

So the question was: can you go back to the samurai slide I'll leave that one up the slides are also online. Okay, I.

B

Have a couple questions: first, one the first of all a ratio request and result limits. So what is your recommendation on this point, so we always need to settle soap request the same as resource limits. Oh, we should set it just messy for possible and, as second questions is the last point, a way to restore your application. If kübra they get a little version control a problem.

B

So how do you do it right now in and that's like use him now, you just use plain Kubrat the objec, keep it and get worse, the control and the second policy with no assistance. How do you achieve the secret power system? Okay,.

A

So the first question was with resource requests and limits what you recommend for doing that for most of our applications, we tend to set the request and the limit to the same amount, because there's nothing more frustrating than finding an application, that's performing worst and that's because it's no longer able to burst anymore. It's a cool feature to have, and it sounds great, but once you get used to it and have a quiet cluster, you don't want someone else, deploying a new application to take away performance from the first one.

A

So I would really recommend, depending on your use case like if it's something that's user facing that needs to serve a request in real time to really consider setting those the same. If you're running like batch processing, where you really do want to use every last CPU and you're not worried about being slower and faster during periods, then yeah not setting limits could make sense, and the second question was how we do actually managing our configuration at elastic.

A

So right now we use helm charts and we then use the terraform helm provider to manage the helm charts and the main reason for that is that we can guarantee the state of that. So you don't need to make sure that everyone is running home upgrade and that they have the latest version deployed, there's nothing worse than going to deploy something and then finding out. The current version is a couple of versions older than what's currently and get, and the other advantage that you get using terraform, for that is that you can also define dependencies.

A

So, like I said before, when we had to restore our cluster from nothing, we weren't using terraform at the time. So we had to figure out what was the correct order to install stuff so like okay elasticsearch needs to be installed because metric beat talks to elasticsearch and then bolt needs to be installed.

A

First, because that's where the secrets come from, so it also allows you encode to define all the relationships you have so for secret persistence, we're storing that inside of Hasek or bolt, which runs in kubernetes, which is a nice chicken and egg problem, and there's a really great tool called the cube vault controller, and what this one does is it lets you in a kubernetes native way, synchronize bolts from synchronized secrets from vult to kubernetes, without needing to have your application, be Cuban Eddie's aware, so it will just create them as secrets which you can use as environment.

A

Config Maps answers your question. Okay, thank you. Anyone else.

A

You no there's.

C

A

B

C

A question that, and so how can we make the consistency for the test, environment and the production environment so because I think it's a big headache for forcings online, because always when we have a production environment and the staging environment, and something of the staging environment is also not exactly the same as a production environment. How we build the consistency between these two environment, yeah.

A

So you're talking about having two separate kubernetes clusters.

C

Yeah of staging a production yeah so.

A

I, don't work for Hachiko, but I've been plugging their products, but we also use terraform for this and the reason of having a staging kubernetes cluster and a production. One allows us to test kubernetes changes on the staging environment and, if you're not familiar with terraform, you define the state that you want. So you have much better guarantees that is actually the same environment as you have in production, but that's always a difficult problem, but at least with kubernetes. It's made it a lot easier to get something. That's very, very close.

A

One minute left.

A

Okay, I think we're good thanks. Everyone.