GitHub Universe 2015 - Deploy, 4 Feb 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Democratic Deploys at Airbnb - GitHub Universe 2015

Description

As teams grow, there is often a temptation to add more process around shipping code in an effort to make it safer. Topher Lin and Igor Serebryany describe an alternate approach — using flexible tools to enable engineers who write the code to also ship their code quickly and safely. Airbnb's tooling deeply integrates SCM, builds, and deploys to guide engineers through the deploy process.

About GitHub Universe:
Great software is more than code. GitHub Universe serves as a showcase for how people work together to solve the hard problems of developing software.

For more information on GitHub Universe, check the website:
http://githubuniverse.com

A

So it's good to see you all here, thanks for coming to our talk, my name is Igor, my name's Topher. um We are oh I. Have this quicker? We are engineers on air being v's, developer happiness, team arm, which has actually been rebranded to developer infrastructure, but at heart we're still consider ourselves develop our happiness engineers.

A

So we're going to be talking about kind of the antidote to this process that you see pictured on the screen here. This is kind of the traditional release management process that involves taking code through, like many discrete cycles and like different people, take it through different stages and then like do different things with it, and those things might be full of mysteries. um So we found that this process doesn't actually make engineers particularly happy.

A

ah The thing that we do to overcome this kind of process is this: this is deploy board. We're gonna, be talking a lot more about it.

A

This is the tool that we've built it started out in about twenty twelve, just as a tool for making sure that our capistrano releases at the time we're actually working that they were actually deploying code that every box that the version of code that we expected it to have these days, we've kind of turned it into a Swiss Army tool of, like all sorts of developer, workflow things that have to do with how engineers do their jobs.

A

um So this is how many deploys we do through deploy board starting in 2012, and you can see it's kind of like exponentially trending up line. This is the number of Engineers using deploy board uh kind of it's a very similar graph. It's basically the same. The reason why both of these kind of grow in sync is because that air B&B every engineer deploys her own code, there's no like handing it off over the wall to some release, managers, or anything like that. um So and we'll talk about how that works.

A

So the agenda for this talk is why why do we think engineer should ship their own code in the first place? um What kind of tooling we've built to enable them to do that arm?

A

How this tooling works and also like what we're thinking about for the future? How to improve these kind of developer workflows?

A

So the fundamental problem and I think the reason why many companies resort to an extremely process heavy deploy process is that deploys are pretty dangerous. Deploys are oftentimes the thing that breaks your infrastructure. If something goes wrong, it's you should look for the deploy that caused that, um and so, in order to overcome that problem, people often use these guys.

A

These magical wizards for called release engineers and these guys are supposed to ensure that, like whatever changes are coming down the pipeline into your application, that they that those changes don't break the application um and um I don't actually or like we at Elm every. We don't actually think that this actually works this way, we have several objections to using this common practice that many other software companies. Some of these objections are practical arm release managers.

A

When they're deploying someone else's code, they don't they're not as familiar with the code that they're deploying as the people who originally wrote that code, so they don't have as much context into like what they should watch for, like which metrics it should be, observing which parts of the site might break like what might actually be affected by the deploy. um They also because they're become like a bottleneck for getting code out the door. That means that the overall process of deploy slows down. Like oftentimes people have like a traditional release process.

A

Then they released like every certain unit of time, like maybe once a week or something, and so as a result of that, the deploys that they're doing once a week become much larger. More code goes out in each deploy and not is actually more dangerous. The more code you have going out the greater the likelihood that something in that giant pile of code is broken.

A

Oh and then, if something is broken, it becomes harder to tell what exactly is broken because there's so many different changes going out at the same time, um and you still like, maybe you could be like okay, maybe we could just prevent code from breaking in the first place by using this release management process, but in reality no one can do that that you really don't know until you deploy to production and like observe metrics, whether the code actually works, like it's really difficult to do that and British managers don't have any more power to do that than any other engineer, and another part of our objection to this kind of process is also like a cultural objection.

A

um We think that by using this process, where you code and then you hand it off to someone else to take care of from there that causes like bad cultural effects. It means you have less autonomy. You don't have as much control over like when your projects actually make it out the door. You kind of lose some responsibility, like you kind of lose some knowledge about what it's like to actually operate.

A

Your code introduction- and we kind of take these things seriously at 13 v, we're a very like core values, driven company and also like one objection- is that the job of Engineers is not actually to write code. The job of Engineers is to make product that they deliver to users and when you reduce their job to like you, write some code and then someone else will deliver the product that actually that's not that's, not that's, not really what engineers are supposed to be doing.

A

um So this is an old picture of never being the engineering team we're in lederhosen. All of these people are really sent. Managers like we make it so that any person- and it may be engineering team- is actually a release. Manager um and AH people might say. Well, you know that's actually, maybe not a good idea like. Are these people even qualified to be released? Managers like? Isn't there a ton of stuff they have to know to be able to do that like they have to know how to deploy.

A

They have to know how to roll back when something goes wrong. They have to know, there's no, how to even like how do they even know what something went wrong in the first place arm and we think that actually like, even if you're, like a formally trained release engineer you, you still have trouble doing all these things. The way that you manage to do this is you have really great tools ah so, like this big wrench here, that's like what release managers use to do their jobs, and so actually the problem becomes really.

A

How do you make these tools available to every engineer so that they're not like super specialized, and you have to be super train to know how to use them, but just like anyone who comes in on day, two as an airbnb engineer, can just use those tools and ship their own code. So so.

B

Okay, great so so far, okay, so so far we figured out traditional release management. It's not magic! It's not going to fix all your problems. There are some practical issues with having us bottleneck that slows things down. We want, culturally for everyone to be able to ship their code and the way we achieve that is with tooling so drinks. Next section we're going to go through some of the tooling that Airbnb has built and show what the workflow looks like so great Airbnb releases.

B

How do they work so it actually starts when you push your code upstream for the first time and open a pull request, we do everything we can to help you at that point. Even before you merge the codons master and put it on a machine. We want to make sure that you're in touch with the right people and are getting a lot of feedback on whether your change is going to be safe to release or not.

B

So the first part of this is routing your code to the correct person to review this used to be for us, but it actually gets really really hard at scale. One thing we observed as a team doubled in size this year, is that new hires did not know whom to CC or to send their poll requests to sent it to.

B

Whoever was their friends would refer them to the company or the new hire buddy that they over lunch during the first week, even if those people were not the ones who are best equipped to review that sticky of code. As a result, we had pretty like there are things that pass people weren't experts on certain domains like our messaging code and they didn't understand edge cases, and so they couldn't give proper feedback.

B

Another use case we ran into was areas of code that we want everyone to contribute to, but which have certain guidelines that people need to be aware of. So an example of this is our API code, we've written an API framework in house, and we want everyone to be able to write their own API endpoints and to modify them as they need to for their product needs.

B

But we also want people who wrote the framework and maintain it to review all the code that affects the API and is possibly related to it, to make sure that everything is following guidelines to make sure they're following best practices and to be aware of needs that the product teams have that made the API frame. So to help with those we created a review routing bot. So in here you can see at the top. We have a pull request. John is modifying some messaging related code. This is traditionally a pretty complex part of Airbnb.

B

It's not just messaging. It's also sending special offers opening reservations. In this case, there was an edge case where we weren't, showing the dates of a reservation after it had been altered at the request of a guess or host, so John's updating that to make that happen. So because this is pretty complex is pretty old. It also has some technical debt that only a few people really understand. Well, we need to have John, have a bot tell John who actually knows about this stuff. So these are the poor unfortunate souls who are responsible me spike.

B

Ben and Gordon the way this works is the bot looks for a reviewers comment within the file. In this case, the messaging controller and automatically post modifies the pr's to add mention the relevant people you can so you can also do this for directories as well. You can put a review pile in the root of a directory apply to all files within that directory at any sub directories. Anyone can set reviewers at any time. We've kind of organically grown the number of things that deserve reviewers over time, as we've seen the need to do so.

B

The other piece of what we do is something that a lot of people do continuous integration, and here we use the github statuses API to great effect. We like this pretty much was game-changing when we started using it because it puts CI checks right into developers, faces they're really used to using the pull request. You I, so putting continuous integration results right. There is really important and it also affects how the merge button is rendered. So here, for instance, please offer continuous integration that job has failed and we therefore say merge with caution.

B

Another thing we do is you will notice. There are a couple deployed, lock, statuses, those are used to communicate the status of the service at the time. If we have some kind of incident going on we'll set a lock and that'll set the status to read so that developers know even if they're changed pass all the tests, it might not be a safe time to merge right now later on. When we unlock, then things will go green and they can merge. If something goes wrong, they can click through the details.

B

We've invested a lot of effort into creating a single you I in deploy board for any sort of continuous integration system, so that developers don't have to flip between a lot of different CI systems. You is, for some reason, CI systems like to own everything, including I, and sometimes the UI doesn't really fit our needs. So we put some time into making sure that everything can fit in is my mic on cool, okay, awesome, yeah!

B

A

B

Thank you. Okay, great so after everything passes already emerged, there's that big juicy green merge button and you click it. You feel really good and after that you go to deploy board. So this is the builds patient deploy board for every PR, that's merged. We, you can see it listed there. You can use the compare buttons in the fourth column, to see what the change set is between any two given builds that just links to the github UI's compare view between change sets.

B

You can see what build is currently deployed to each development, it's at each target environment. So if you want deploy just click that deploy button and you get this little modal, you are watched over by our vp of engineering mike who took an absurd photo with some magazine and now we've created stickers and put them all over the office. It's pretty great I, don't know if he appreciates it. We have two different environments you can deploy to.

B

One is called next, so the way it works deploys work at Airbnb is in addition to all the testing we have an environment called next, which is similar to staging. It doesn't receive production traffic, but it does otherwise behave like a production machine. So what we'll do is we put changes on next and then we'll ask engineers to manually check their changes. This is another way of saying like before the rubber.

B

Really it's the road before when the rubbers is kind of like scraping the road you can kind of check and understand if anything's going wrong. So you deployed the next and at this point this is where you go. Two steps in we created a little bot named Yoda, who hangs out and slack and tells you how the status of the changes is going.

B

So the first thing that Yoda will do is tell you that the changes are next it'll list, all the PRS that are on next and it will add, mention the user to inform them that it's time to check their changes, then each of them will come in and tell Yoda the status of their change. They'll say they're checking or they'll, say it's good, occasionally, they'll say it's bad, in which case we'll have to deal with that. So this is kind of the usual flow of how things go.

B

You can see that in this case we have a couple users who haven't checked in yet, if they're not available, you can poke the reviewers as well will like parse through the PR comments to see what users were involved, but users signed off and will ping them as well cool after everything is good. Well, everything is good, Yoda will say: hey Freddie you can deploy to production and Freddie can't deploy to production using deploy board, deploy board updates in real time. We have this kind of deploy counter.

B

It shows you all the machines, how many are in a waiting state, how many are ready to deploy and wish machines are deploying at the time. If you need to drill down, we give you the ability to list all the machine role and click through to see the log of the deploy. So it's all accessible right there in the UI really easy to use.

B

You don't have to SSH into any machines, you don't have to do any like Grippin or anything, it's all right there, all right and then after it's done, deploying you'll see that you know we mentioned the users again just to let them know what's going on, but we also linked to some key stuff. We link to our new relic dashboard. We link to a bunch of dashboards with business metrics so that people can monitor and quickly see what's going on. So what we're doing here is we're automating, a lot of the communication and knowledge sharing.

B

That's really important when you're releasing software, we're giving you like we're putting stuff right in your face, so you can understand whether your deploy succeeded and how it's affecting users, but then we have to consider the question what if something goes wrong? This is something released. Manager usually takes care of, and you want to be able to do something in that situation we don't have a central release manager, so we need someone to help out when something goes wrong like this poor girl. So the first thing we do is we have automated alert notifications.

B

We you, if the error rate increases by two much here, it's gone up more than three hundred percent, so we add mention everyone in the channel and say hey. This is a problem. We need to roll back, or at least investigate. What's going on, we make roll back. So here you can see that the AB mention has brought a lot of people into the room. We have a picture of a system thanks to Ben the error rate raising.

B

So how do we roll back? We make rolling back, really really easy. It's right there in the UI once again, so remember this photo of deploy board. These are the buttons that are right there in the deploy progress bar. So if you click roll back, you'll get a menu that explains to you, a modal that explains to you.

B

What you're going to roll back to you can see what the chase set is, and you can see what the Shah is: that'll just abort the current deploy and immediately begin a new deploy of what was previously on production. If this is a more tricky situation, you may just want to stop the deploy. Midway and leave some of the boxes and it's kind of half done state. It kind of depends on situation. We give developers power to make to use their judgment and make decisions. So here's the board model pretty similar.

B

It gives you a little bit information explaining when you want to abort versus when you want to roll back, then then afterward. What we want to do is lock the deploy if master is in a dirty state. So here we have the ability to set a lock on the deploy, lock. Just the application- and we say: hey master- contains to change the brake search. If you need to talk to us about it, coming to slack will explain what's going on and we can figure the situation out together.

B

So when you set the log, it generates the slack notification and as they point out before it communicates the current service status to all pull requests.

A

B

That just like pings all the open, pull requests and sets that status to read so everyone knows you can't deploy right now so again we're automating communication about what the status is of the service like. What's going on we're taking a lot of operational knowledge and just broadcasting it in so that everyone has this knowledge, you can also set deploy locks for other reasons. One thing that we do often is we lock if we're going to be on the news, so, for instance, when we're on Ellen DeGeneres, probably the biggest event we had recently was.

B

We got really big in South Korea g-dragon, this Korean pop star decided that he was going to rent out his house on Airbnb, so we locked deploys for like 12 whole hours. So that was an interesting time great. So that's kind of how we work and Igor will walk you through in a little more detail.

A

Cool, so let's review what we've talked about so far, so we've talked about how engineers can and should ship their own code arm. We talked about the process might be complicated. Actually you could just make it as simple as clicking a button.

A

We talked about how really like the process of release management is really a process of communication, communicating with all the people who are involved in making the code and shipping the code and we've kind of automated that away using the tools we've written and we also talked about how to recover from mistakes which another thing that release managers often do. ah So you might be wondering how does all this work? um So you do? How does all this work under the hood? So this is kind of the flow of the system.

A

It starts out with engineers pushing code to get up enterprise, which is what we use is our version control system from good of enterprise. We found that a lot of integrations really look the same Michael options.

A

People want to use web hooks to like make automated systems and integrate with Demeter prize, but then it can be kind of a pain to like have to set up individual web hooks for, like all the different things that you want to write so we've written, we have a single web hook that is installed automatically on all of our repositories and that web hook ah sends all events from ghe to a single publisher, and that publisher turns those web hooks into RabbitMQ events. So then deploy board and other systems subscribe to those rabbitmq events.

A

That's basically the stream of all things happening in get land, and this is really useful because you can see the stream of all different kinds of events like everything, from new code being pushed and new benches being created to comments being lost on PO requests, and some of automations rely on that stuff as well. um So under the hood, deploy board itself is actually composed of several different RabbitMQ listeners, each of which performs different functions. So some of these um uh dispatch our build system, events like our CI events.

A

Some of these deal would like add mentioning people and slack when code is ready for them to review some of them deal with deciding who should be reviewing code and like assigning those people to the code review process and there's many more that I'm, not even talking about here arm.

A

So this is um so I think I skipped us through a slide.

A

Okay, so arm under the hood. All of this work is implemented. For us as rescue jobs were a big rail shop we use rescue. This is also the system that getup itself uses for it to delay jobs. This is an example of our like basic build job. It's kind of like the thing that we use to build. Many of our projects don't require anything particularly most more special. It's like this job will like we find that a lot times people write their build systems and she'll scripts.

A

If you use some kind of job dispatcher and the job dispatchers, often like Jenkins or something and then in order to drive Jenkins you're at a bunch of shell scripts which, like nobody, can read or maintain because their shell scripts, so we find that Ruby's a lot easier to use for this and rescue is just as good a dispatching system as Jenkins is, if not better in many ways. So this is a basic build job, it will clone ah the repo and then it will install some dependencies that are you looking at p.m.

A

install it'll do a bundle, install and then it will, if it's a master, build, create a build artifact, including all those things that just install the dependencies and ship that so that it's available for the deploy system to actually deploy ah the build jobs kind of stack on top of each other. So this is like a rails build job. It basically is a subclass of the basic build job, but in addition to that it also thanks Topher. ah It also does a couple of things that are real specific.

A

In particular, it compiles assets for rails and packages goes up and uploads them to s3. If we use an asset host for this project- and um maybe I'll I'm just going to try to keep clicking and hopefully it'll just work, okay. So uh how does the play board know what are all the different tasks that it should do when, like some event comes in from RabbitMQ? You get like it's suppose. You get an event that, like new code, has been pushed to click the repository. What is it like like what happens next.

A

Okay, ah so um this is a screenshot of our deployed board, apps repo. This is like the JSON that engineers right to configure their repositories so suppose that you're, a new engineer or like let's say, you're, an experienced engineer and you have new project like some new thing, you want to stand up and be serviced and you want that service to get the regular CI flow and the regular deployed flow. You write one of these. It's just a JSON, config it'll specify. So this is the jason config for deploy board.

A

We use deployed board to deploy itself, so it's just like any other project. ah So uh you know you give us the title of the project. You listed it's a repository URL, which is how we know that this is the JSON config that corresponds to those events like if the repository URL matches and we specify some additional things like Oh, post notifications about this app in this channel, and then we have a cup of CI events listed here. This is like the build rails job that you saw earlier.

A

We use solano to do the unit tests for this app, so it also dispatches those from that from that from that same place and then, finally, at the bottom, you see this collection of targets ah these. These are like the things you actually deploy to and you can specify like like what are those things like?

A

What are the different deploy targets, and I also like what are the specific roles of the machines involved in that deploy, so for deploy, order to be like some web workers, some chron workers and some rabbit and key workers which do all the which wich do this stuff that we're looking at. So um people are pretty pleased with this. They we would like to encourage people to write more services over time, and people have been writing more deployed board apps over time.

A

So we now have like about 170 different deployed board apps that are like they all have the same kind of workflow, not necessarily always as complicated, obviously keep working on a smaller app. It's a little bit easier to deploy. You have fewer people contributing so there's less coordination, but for something like are monolithic, rails app, which still has a lot of code. This kind of coordination stuff is useful, cool.

B

Cool so another review, so we've talked about why he built deploy board, how people use it, how it works under the hood, some of the technical details and now we're going to talk about the distant future of what we hope to achieve.

B

This is okay, okay, so you know we have the system, it works, okay, but we know that there's a lot of room for improvement, so one of the things we want to do is we want to add canary, deploys one of our issues right now is we currently only deploy master and, if masters broken, we invented this lock system and we can easily roll back and all this stuff, but there's a period of time where we're waiting for the revert to be processed to create the build artifact, and that creates just a long delay that doesn't really make developers happy when they're ready to go, but they have to wait to see their change in production.

B

So one thing we want to do is start doing canary, deploys the ability to take a branch and put it on a subset of the production machine, so they can receive production, traffic and developers can what the diff is an error rate or they can play around with actual production data to see you know, hauser change working out. We have a lot of the Meccan, the mechanics for this or which they're on the back end.

B

One thing we want thought into what the UI looks like developers to see their favorite branches, understand: they're, like work-in-progress branches versus the branches that are ready to ship and to create a UI that easily lets them see what the difference is between the error rates of production and the canary. Another thing we're doing is core is expanding the alert automation for metrics, so the alert that we show prior was the new relic error rate and notify everyone to channel.

B

You know: that's okay, we'd like to be able to alert on more specific metrics, including business metrics, so conversion metrics for the core booking flow Facebook, sharing payments by country and under or like trust and safety, related metrics, and be able to alert specific teams or even be able to highlight certain individuals if we know that they recently altered code that is related to this metric. So you know right now it's kind of a blast everyone, but we want to make sure everything everyone.

A

B

Gets information that's relevant to them cool, so I talked a little bit about making deploy board more personalized. Before with seeing your favorite branches, you know we show why there were just bills listed and deploys listed, and actually you don't need to see that information that was kind of a naive approach to you, I that we took just to make things simple in the beginning, but actually now that we're building every single pull request, you know you're not going to deploy every single one of those, so it might not be useful to see them all.

B

We want to give you a more particular view of stuff, that's relevant to your team stuff, that's relevant to you at the very moment in regards to understanding the recent history of deploys recent walks and so on, maybe even upcoming locks that are going to affect your workflow. Finally, we want to do better data gathering. We want to prompt people during rollbacks to do the right thing. We want to give them information about. You know what apps are historically risky.

B

What changes to certain files or parts of the code base are likely to break or interact weirdly there's a lot of information now that we have employed we're database and we'd like to do more analytics on how our workflow can be improved. If we really understand going on you.

A

Want to talk about this, oh one thing that we found is that actually a lot of companies probably build something. Kind of similar to this, um like I know that a lot of big companies build tools that, like ship their code out, but then we find that, like you know, like we talked to people at other companies and they've also all built some kind of homegrown deployed flow, and this is like really not the like github way right like it's silly that we're all building the same tools over and over again.

A

So we would really love to open source deploy board and make it available for, like other people to use, know the companies and also to like improve the project arm and right now. The project is like a little bit, you're being be specific, it kind of it it like consumes a lot of things that depend on the Airbnb environment, and so it's like not really, it's not necessarily a trivial project to factor it out and make it a standalone project.

A

ah But one thing that would be helpful is if we received a lot of feedback from people who hear talks like this, that they want to use a tool like this, that it would be helpful for them. So if that's something is interesting for you, and especially if you want to try something decide at your company like maybe before it's a completely been open source and a great lead be put on public github, but like maybe we can try to figure out how to install it in your environment.

A

Then come talk to us after the talk, because that's a the kind of spur we need to actually put in the work necessary to make those generic.

A

Okay, so um something doing I'm gonna finish something up just.

B

Review what we've gone over, you know we talked about release me. We don't want release managers, we want everyone to release their own code. We think there are practical benefits to this. Primarily, we think there are huge cultural benefits of this making developers happy and giving them a lot of control over their own workflow. So if we look at the list of the things release managers, do you know we had all this arcane knowledge like how to deploy how to deal with bad changes?

B

How to understand if something's bad, we replaced a lot of that with tooling. So you know just click a button to do most things. We have some automation around communication around alerting and when things go really south, we have locks to help communicate service status and prevent people from making the situation worse. So you know one thing is you know we're doubling every year as the team scales, it's very tempting to try to put a lid on the chaos by centralizing controls, trying to make sure like okay, everyone has a certain probability of breaking things.

B

We need to really funnel everything through a knowledgeable expert or a team of experts in order to prevent problems from happening, but you don't have to do that if you and think about carefully about like what you're trying to do and invest in the right tooling, for you and your team, you can allow people to run free, given the right tools, helping them be prepared for the challenges they're going to face. So that's the message we want to send out today. Thanks for listening, does anyone.

A

Have questions we have about 10 minutes for questions yeah.

B

A

You can Club thank you. David.

B

A

Cool so right there.

A

Yeah, that's right, so the changes a lot question yeah. So the question is: how do you deal with like auditing requirements like what happens when, like ah you know, maybe this code is particularly sensitive or something like that. So I think the controls for that can be enforced at the repository level like in general. If you can push through the repository, then the game is over right, like it's not about deploys it's about like the code itself, so for some of like in general, everybody we keep all repositories public.

A

Like generally, anyone can contribute to any repository for some of the more sensitive stuff that we do on, for instance, our payments infrastructure. Those teams are starting to lock down some repositories that contain the core code, ah and only a few people can push those repositories and by enforcing the controls and pay her like you know like they have access to the same tools, the same workflows. If, like it like a you know, the build will be generated for the payments repository if a code event happens not repository and then those builds are like signed.

A

And you know those are the only things you can you can deploy, ah but yeah I think the controls are around the code itself. Does that does that answer your question.

A

Oh, we don't that's right, we don't have that. No I mean we assume that, like there are places to hook that into right, like at the point where you are allowed to click the merge button. That's the point where we would say something like you can't hit the merge button unless, like the appropriate people, have reviewed your code or like only some people. This is why we're really excited about in protected branches and mandatory statuses. This is like a huge feature for us super excited about that.

A

B

A

B

From the master dash yep.

B

Sure so the question is about you know this is about making changes to code, and the question is about what about making changes to data, for instance altering the schema of a database or about like making alteration to a data set that might break halfway through so schema, migrations are kind of a sensitive issue. They can threaten stability quite a bit, so we have a kind of a more cultural system right now around making sure that schema migrations are safe. When people have a migration, they want they'll write a proposal for what it would be.

B

They like actually write out what the alter table statement is going to look like, and they send it to a team of people on our production infrastructure team who have expertise in my sequel and they get feedback on that oftentimes. In that case, for really sensitive schema, migrations, encore tables.

B

What will often do is we'll wait for a little bit and collect like several proposals that people have and then we'll like bash them all together in one alter table statement so that we don't have like crazy stuff going on all the time, but in general we have like lots of little schema migrations going on all the time. The process is fairly lightweight. Most things that aren't threatening will like pastor review very quickly in regards to changes to like services that serve data.

B

We, the plea that varies a little bit, not all the services that do that kind of thing use deploy, bore for that kind of like data release process, those that do generally have the ability to just kind of start from scratch. I guess it's like idempotent! You could call it right if you're going to release a data set, you release it.

B

If it breaks halfway through that's bad, but then you just send it start another deploy, and it replaces all the data in the service again it might take some time but you're eventually going to be fine, there's not any sort of data like change or migration process that actually can't be recovered from by just like just wiping things out and replacing it all.

A

Yeah, so the question was like: when something does break. How do you I guess, inform the right people like how do people find out what broke and give the right level of detail? So um it's no ones like job rly to do that.

A

Right, like if I'm deploying code and there's like three other people with code in my deploy, then all those people are kind of shared responsible for like that deploy going successfully, and if that deploy doesn't go successfully start getting things from the automated systems and a tomato system say the error rates increased or the business metrics are dropping or something like that. Then those people will roll back.

A

The code there'll be like, like prodded by the system to roll back the code, and then they have time to investigate and so they'll be able to figure out like Oh what exactly broke. How do I revert that change? It's that the job of those people who happen to be like managing this particular release. So then the question is maybe beyond that. You know suppose it like we're always like breaking code for the same reason like every single day, like like people cause like the same basic incident or something, and how do we like, like?

A

What's the longer term solution beyond just like reverting the code and like fixing it so um we have like this is. This is not really part of the deploy process, and so it's not really handled by the same tool. But we do have like a post-mortem tool where, if people break production in like a severe enough way, they'll often write a post mortem. That explains like this is what happened, and this is why, like this, was difficult to catch.

A

It's like a tricky issue, and here are some remediation steps, so we can take to figure out. You know to make sure that this doesn't happen like over and over again and then either they or like the teams that are interested in that stuff will like take it on. So, for instance, if we see that, like one reason that we're like breaking the site over and over again, is because some like front end pieces are not tested or something then like.

A

We have a core web team whose job it is to like make sure that we have good testing tools for the front end, and those teams will like see the post mortem and I'll be like. Okay, like this is clearly a problem. We should like put it on a road map and figure out how to solve the problem.

A

So the question is around canary environments. What does that mean? What does that look like? So, basically the way the flow works? It's not quite finalized yet, but this is like something that we're like currently rolling out.

A

The first step is that we need to produce, builds for both branches as well as master and that's like the easy step, and then we have to go kind of a UI. The lets you discover the different builds for different branches. Right now, are you I specifically around master, and then once you do have a build you can you have a deploy button?

A

Where do you actually deploy that our plan right now is to spin up several canary environments so like, for instance, for like monorail, which has like lots of deploys, and there might be lots of like a like a backlog like a traffic jam to use those environments, we will spin up like five or six of them, or something like that and then you'll be able to deploy to that. The deploy system will make sure that you can't deploy build it's too old.

A

A build, that's behind master I'll has to be forward of master, and then you will lock that environment to yourself automatically for like some period of time and then after that credit time is over. We'll just turn that environment off um and in the meantime, that period of time you have to see how the error rates, or whatever differs between the environment and so that environment. We receive production traffic as well, and you also be able to go there in your browser under a separate name.

A

So if you deploy to Canary 6 you'll be able to go to like canary 6 that airbnb com and see like what up. You know your your code, specifically it's a little bit like next.

B

um The question is: what's the logic behind doing.

A

B

Static set up in your environment versus having on-demand Canary.

A

Islands, this is kind of a deep infrastructural question. The the real reason is that for us, it takes uh probably around 10 minutes or so to spin up a new instance, because we run chef from scratch on like a blank slate instance, and that takes like some time arm. So we just want to make that process fast so like we want to have the boxes like ready to go. That's that's kind of like why we do the deploy process.

A

The way we do anyway, like a lot of people team, you know companies or something might like spin up a whole new set of workers for a deploy that it would take too long for us. We just want to reuse the same instances. This might be optimized at some point in the future, but for now the system is like working. Okay, I already.

B

At a time is that our signal or.

A

Do I be one more question on my question.

B

How're you what haha yeah so okay, wait, how we use New Relic to measure error rates, yeah.

A

B

Yeah, maybe I misunderstood question, so the way we use new relic is they have this gem that works really well for rails applications. So if you just put in your gem file- and you know- add some configuration- they have like a nice default configuration file. Put your license. Key, maybe like tune things a little bit. It'll just automatically report errors to New Relic were.

A

You asking for something at.

B

All oh I see just exceptions, so any requests that five hundreds to a user because of some exception that the code raises goes to New Relic, so those are actual errors that the user encounters. We also have separate exception, like an exception tracker, with more detailed back traces that lets you filter by controller in action and all that stuff.

B

Those also allow- we include errors, in the exception, exception tracker that the code rescues and then you can like manually send the error to the exception tracker just so, you have more information to debug without impacting the user experience, but if the New Relic error rate itself goes up, that means we're seeing a significant change in the user experience. So it's well worth the learning on cool thanks a lot we'll be here afterwards. If you want to come and talk to us some more. Thank you.

B