Jenkins San Francisco JAM (July), 27 Jul 2016

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: SF JAM Hosted Jenkins @ Google

Description

Scott Zhu talks about Kokoro, the internally hosted Jenkins service at Google.

A

I'm responsible for the manager for the team at Google, that's doing internally hosted Jenkins we've been working for about the last year on bringing Jenkins into Google and hosting it for our developers. It's replacing a prior system which was proprietary and closed source, and this is sort of a little bit more of the story of our journey. Working with two on board with Jenkins.

A

You can see the there's a prior presentation by my previous TL, which is up on YouTube, which has documented some of the countries to be made to the core and remoting to help improve the scalability of Jenkins, because we thought that was really important for us and Scott is a my TL he's going to be talking about sort of our journey for the last year and our experience hosting Jenkins some of the stuff we've done. Hi.

B

I'm Scott I'm a software engineer in Google and I'm, currently the CEO for the house that Jenkins team and we we think Google. We call our instance Kokoro. um There was a I think when we decide a name. We had a small boat for this. Actually, our previous tool is called post and it's related to heart and heartbeat. So we kind of think oh maybe give a new code name, which is the related to heart and kind of the Japanese word is cool and the core is the meaning of heart in Japanese.

B

So that's we had this code name, okay, um so today, I will talk a little bit about the hosted the scale of the hosted instance within Google for Jenkins and some of work. We have done to make it more stable, reliable and scalable, and also some some array and the key ways.

B

Okay, so that's the kind of traffic graph for our instance from like November last year, we start with Windows launch and then there's a Linux launch in April this year, and currently we have about 1200 views every day in a normal business thing and we are hosting as a single turn master within Google right now and the reason why kind of we want to do that is we discover that a lot of team trying to run their own Jenkins instance using their local workstation or under their desks, and we think it's not very scalable.

B

We want to consolidate all this together and give them better experience and the the part of reason is to debit deprecated owed tool and also try to avoid those like ad hoc usage of rankings. As yourself and currently as I said, there is like about 1k per day and we run about 100, ish agents for Windows and Linux, and there are 200 projects and 200 daily active users.

B

You might wonder, like Jenkins, already supporting like Windows and Linux, why we have a kind of lunch date for that we kind of run a different approach for hosting slaves. Sorry, the Jenkins agents keep forgetting that : slaves, so the we want to have a sandbox experience. So we run all those windows and the Linux agents SVM, and we actually run this agents binary outside of the VM itself.

B

So it's kind of like a peering instance and then whatever Beauty Square, for you run whizzing the Windows and Linux it's more like sandbox and we actually reboot the Windows and Linux afterwards, so that it's a clear environment for the next view, and you can guarantee that the boot is replaceable and an emetic and we actually put a kind of some amount of work to support the Windows and Linux flavor of boots. And then currently we are support.

B

We're trying to support the Mac, OS and iOS viewed and that's currently under our first priority, and we have teams which one who want to use our tools and they come to us and they have various requirements and a use cases. And we discover those quite constantly and we expect maybe losing this year. We probably can't hit like 10k buttes per day.

B

Okay, so works, we have done we. There are some part of the work which we made the invest like automate automate, the Jenkins fair over that really benefit us from the beginning, so that we can have a scalable and the reliable instance and also some of the work like project configuring. The sauce. Would you invest a lot of on that and during that time, I think that pipeline? Wasn't that mature?

B

We kind of invented our own wheel there, but we do see the benefit of having the project config and the convenient source code itself, and also we have added a services outside of Jenkins which helps people to Auto, create and delete project, and also some of the work we have done to reduce the massive workload, so that can be more scalable and host more boots and most more agents.

B

Okay, jump to the boot config in our project, config in the source. I think there are already existing plans and approaches to do that. It's kind of, like my my favorite, mean in like xkdc, it's like. Oh, there are 14 competing standards and it's ridiculous. We should invent one for ourselves and rule them all, and now we have 15 kind of centers now, so there are existing like llamo and pipeline dsls.

B

From our perspective, we would you want some features. Currently, I, listen as issues but I would say its features. We we really want to have, because within Google we have a giant code base and that's a single code base. We are trying to use right now. It's called code name is called Piper, you can search online and we want to have project or putting their configs in that color poetry and the copying shared configs with it like between projects, it's kind of pain and less scalable.

B

So we want to have project sharing there configs between each other and also we split the project config and you config into two places so that usually the butte config, your beaut staff is, and we want that boot configuration in a certain version rather than always reading it from head. So that example, if you want to build your own version of software, you still can do do it rather than reading all your config, which is a from head and may break your build okay. So the buttes, as I said, is generating as generated at runtime.

B

By reading the external storage, we have the that single code repository and after you get synced out the boot steps actually reading from there and the generating from there. Our.

B

If you look at our Jenkins instance and if you go to the project, the Jenkins master project, home directory you'll probably see empty configured XML and we actually literally just reading the source of choose from the code repository and we split the project, config and configs itself and reading the boot config from the repository at the runtime and also another feature we actually introduced.

B

Probably that apply to a lot of you, but we think it's very beneficial to us, because our instance is kind of big and we don't want people to randomly come to Jenkins, UI and clicking button and changing stuff, and what we do is completely disable the UI for project config. What anytime we want to do a project config, you have to create a changeless and then send it to review after people. Ok with that, the changes will get committed and then your config change take effect so so that we actually can tract.

B

We change, get made to the project itself and also any chance any chance of like randomly breaks. Someone else project. Those kind of scenario can be avoided and also, since you cannot create, since you cannot modify to configure itself, we have a service to automatically listen on the commit of project config and then create, modify and delete the project for you.

C

B

Good question so usually, yes, I can use this example. We have a, we have a pretty I would say constraint, syntax and also what value you can put here. We have a validation tool anytime, where you want to make the config change. You have to actually pass that validation first and also example.

B

If some of the field you cannot validate, if example, the pass of the Butte file and I cannot guarantee that your change the view script will actually pass or fail before I bring it so those field we cannot validate but for rest of part like the the sink hex we have to validate it before you actually commit.

B

So this is a sample project, config and config. The project, config user- tells you what kind of Bute it is and all the metadata of the project itself not relate here tree of you.

B

The only kind of thing related to food is what kind of source code repository you have only the example is a get dummy, get your whole tree and it will fetch the code from there or listening anything from it and then, after the happens, the code get put out from your git repository and then the Butte configure itself is actually living in one of the file of your source code and we're really from there and knows which groups were to invoke and what kind of artifacts I should collect after the beauty.

B

That's a very dummy, simple example, and we do have a lot of complex after that.

B

Okay, another work I actually like that, a lot um the reduce of the master workload at the very beginning. When we start the project we are thinking about. Oh, we are we're kind of Google and we run stuff in scale.

B

How many views or agents can we actually host on that single master before we hit any performance issue and we quickly run a test and it's more like a load test, hemming all the boots and agents on this monster and see if it survives the rough number we get is kind of 500 agents and a 500 per hour before you really hit any performance issue, and some of the aspect we see is during the bureau generate a lot of logs and since we we need to back up the master setting and the locks come kind of comes with it.

B

We see the backup size, get accumulate and increased very quickly and that actually increased our backup time and restore time when it tried when it becomes a problem. So the first thing kind of we did is use the external log services, currently every single agents when it starts up you'd, actually stream. All the log directly back to your master and master will save you somewhere. Unless you have a project, config say how many project history I want to keep your stay there forever.

B

What we did is we directly redirect or the agent lock to external lock services, so we don't save those slots at all on the master master itself were only it doesn't it. It goes straight to the logging service it didn't. Go back to master master, only keep a reference of that log entry and that actually reduce the backup data and so the traffic on the master and the load on the master. A lot and some other aspects like for SCM pooling.

B

We do and have an external service handle that so, for example, if any get changes, I'm not sure for the current gift plugin, but my guess is it will periodically pull in the git repository and see if there any changes come in. We have an external services to handle that, and we have a API interface say anytime, that external service listens on any change. You can trigger a beaut, our master and the master will kick off the boot for you that also reduced the masterlock master traffic and load.

B

We also save directly save the artifacts to external storages that actually saves a lot of space on the master itself. If you do package a lot of binaries on the master, sometimes you actually want to keep them for fairly long time, because you want to compare with the old versions and we we kind of have some other storage system using Google which save those binaries for you.

B

Okay, mas Ferro- that's that's also the very important part within our instance, so that we can actually host it more reliable in Google. We, we kind of have a very few scenario that there would be a singleton master there and everything is kind of shotted and that traffic get distributed globally.

B

So when we hit this case, we are thinking about. Oh, we probably should to at least some warm standby sitting there waiting see if the live master is still there. Otherwise I should quickly grab the mastership and then I will start serving a user, and then we rely on the google internal master election service. So there will be. We have three shards there.

B

One live master will actually the master lock and then, while serving all traffic's in the meantime or periodically, sending and a backup their data to remote data storage and all the agents will connect you correctly via the master election service, I kind of know.

B

What's the live master right now, if there's anything happen to the live master itself, for example, someone kicked the power cord and the mas, the light master will lose the mastership immediately and then the backup master will kind of get notified because it's keeping trying to grab the master lock itself and for anyone who grabbed a master, lock will immediately say: okay, I'm the new master now and I'm trying to grab the massive log data from the backup data store and then order agents will also get notified because it's keeping the previous connection get lost and also it will try to reach the mass election servers, say: Oh who's, the new master now and then I can connect to it.

B

Currently dad well give us the benefits about every time when the master switch. Our downtime is about one minute. During that one minute you will not be able to access that Jenkins, you I, don't do anything, but afterwards there will be life in a server again, but for those in flight viewed it will also get corrupted and which you actually riku them and review them.

D

B

We actually have that will actually be covered in next few slides. um We have a very dynamic scaling agent who's, so the agents itself get bring up by itself and it connect to the master and authenticated by a certain Google internal mechanism. So we don't actually met config those agents and make them connect to master. There's an automatic mechanism force to do that and do it in scale.

C

C

B

I think correct me: if I'm wrong, David I, think the Butte juice also get backed up if there's a unexpected termination of the job, if the Butte itself might get lost, but in some certain case for example, if we want to do deploy anything, we do actually keep the master boot queue in the backup itself. So when the new step by comes online, they will actually have the new out history of the boot queue itself and then we'll review all the pending jobs.

B

Yes, so we, the agents itself, is when the connection with the previous master get lost. The first thing you try to do is talk to the master election service, say who's the new mass or not, and then we get the new master address. We will try to connect with the new instance.

B

Yes, so the master election service actually now write the new master, for example in the middle here, it will try to grab the master lock from the master action service and then the marks election service- you all know, oh. The second instance is the new master. Now and I should tell everyone who's coming to me say who's. The new master I should tell them the second one.

B

No, the agents actually talked to the Marcie election service. It doesn't do any like SSH connection. We we have we'll have google internal like RPC mechanism, to do that. It's basically not ssh connection in the oil HTML. It's it's an RPC procedure. Yes, we have an open-source version which daikichi our keys. You can search at GRP, see yes,.

B

Sorry, can you speak a little bit louder.

B

Yes, so the agent itself currently have a very small interval before I try to retry the masa itself. The one minute interval is between the true master fell over. There are some times and for the new master to take to grab the master back upload data and load it to itself and initialize itself. That's for the one minute, and then the agents were kind of periodically retrying say who's. The new master.

B

So there yep there's never a data sink, betraying masters mas.

B

It's it's more like the master itself, always to talk to the master, backup data. So currently we are not mounting it as a file system, but in future we probably can um currently it's sending those backup data to a storage system, and then we copy it back afterwards. So that's why we want to reduce the master backup size as much as possible so that we can reduce that copy.

A

Yeah, yes, sir.

B

We currently have about 100 agents right now for Windows, since Windows and Linux, but we'll add more in future, while covering like Mac OS storage, I.

A

That's what we really like to do, I think we're talking about that vision and trying to figure out okay, we're going to take baby steps towards it, but the ultimate goal is we really like to have it be completely a distributed service with no master required and there's a bunch of steps between where we are and there I don't know how long it'll take us but we'd like to cooperate so.

C

The day before the sessions at Jenkins world, which is going to be in Santa Clara in September, start we're having the contributor summer summit and one.

E

C

That we had earlier in the year was a contributor summit before closed down, which is a big open source conference and co-state laid out his vision for the next five years objectives and to get to that point where we can have Jenkins being more of a distributed service.

C

We've got to address some of the things that the Google team has already addressed, like storage, underneath that's really a huge bottleneck for the master, and then we've got to talk about distributing load and responsibility within that's currently owned solely by the master now in two different nations, and so Co skis laid out some of the vision for this, but the challenges as a project culturally, we've always done these weekly releases and that's a really great way to incremental iterate.

C

But these big design things are kind of new for us, so at Jenkins, oh, that was the first big long. The French we had a branch going for six, which is unheard of in that eight years, have been in the project, so at a contributor summit, we're going to be talking about what are the beginning of those baby steps that we can do so.

B

If you're coming to Jenkins.

C

Royal I hope you are, the contributor summit will be posted on on Vita I hope you joined like it's.

E

Not something that I think.

C

Cloudbees by itself or Huggies and Google, that's right, like it's got to be or there's smarter people than some of us that have been in the community. That.

A

Know how to build distributed services and we need help to get Jenkins there.

A

And we have some expertise, but we really.

E

Want we don't want your own.

A

B

Okay, there's also things I mentioned that we have external service which automatically manage the project, creation, editing and deletion, and some of the work we have done to dynamically make agent come online and register itself and make it ready tribute. So we we actually don't have any agent management overhead right now, I would say we have. We run agent SVM's and it's pretty standard and templated, for example. If I want to bring up like 100 more agents, I can just easily change a config file and it will be online like in five minutes.

B

So as long as we have the enough resource for that and the the the tool which enable us to do that is like the google internal scheduling service and we run all our jobs on it. Currently, there is a open-source version for that called Coogan ad. You could try that and see if it works for you and there's also things I already mentioned, and the agents will automatically connect reconnect to the master if it loses the connection with the out one and that it will talk with the Massey election service to the.

C

B

We actually run armboard.

B

So the service is kind of quite stable for us right now. There is only one, our unplanned outage during the first eight months of the project and that that outage actually is not related to Jenkins Seto. It's related to our mass deduction service, so I would say it's quite stable and we also have done some work to support the Mack Butte. Currently, there are some challenges for it and I think also true for some of the user here I would say: running Mac, PO or Mac. Feed is kind of a little bit hard to manage.

B

The the Apple has a very special license agreement that you cannot run or Butte any iOS on Mac OS application outside of Mac hardware, so you cannot example: Google will happy to run Mac OS VMs in our production VM, which is hosted on Linux machine and the License Agreement kind of prevent us from doing that.

B

So we'll have to set up special data centers to just hope their help, those Mac machines and that run Butte from there, and that also prevents us from using that perfect Google scheduling, services to schedules and we kind of have to do extra work to serve the Mac use case. Currently we have a design which is some were similar for our existing use case.

B

We have a kind of middle men service sitting there for registering or pairing the agent job with the maket itself, and we also run agents outside of max so that we can have the sandbox experience for the field and also that registry job will also monitor the health of the max and, if there's anything happen to the hardware, it could report issue to us and also maybe reboot machine and see if it works and that's more like a high.

B

It's like a overall design graph, there's a registry service sitting there coordinating agents and the max between each other and then the agents that the master itself were not directly talk with Mac. It will only talk with agents and then agents, world synchronize. The job betrayed max.

B

Currently we didn't run any virtualization, we run Mac minis, so it's more like a one by one of one, one to one peering between agents and the Mac.

A

And so the 2014 one and then we just said: what's not worth it, so we went right back to bare metal.

A

Only two M's, so you can't and the pros becomes prohibitively expensive, also their their cylinders, which are cute.

A

We are pros only because they're not cost-effective, you know if you can only put two VMs on it. You have to put about ten before it would be worth the money, and so we just put the mini.

D

A

Virtualization we did so we experiment 2012, we may it worked for two, but without key expansion capabilities instead, 2014 isn't powerful enough. Who knows maybe someday, though I think they did that on purpose, because they were cannibalizing their pro sales. That's the price differential is ridiculously stupid between between the minis and the process, such a steep jump that they have to justify it. Someone.

C

E

I think if you're on Mac OS server, I.

C

Think there's potential more than two VMs or you license and.

E

Then the license. It also says that you can agree to.

C

Run more on one bare metal chain of you under terms of happen, which is.

A

We basically told.

C

B

Right, there's also some other works. We do to to reduce our operational workload and make the deployment and provisioning of the Mac easier.

B

We have a server which handle basic Mac images and every time when Mac reboots and it will grab a fresh image from there, so that we don't leave any bad bits which could affect the next view comes in and also we turn off. Some of these services on the Mac yourself example like the suspension service and also like there's a screen saver mode on Mac like fries and the turnoff, your Mac after 20 minutes.

B

If we don't have any actions and return off that- and you know you actually don't- have an easy way to wake that up. If you, if you don't, have a way to physically access it and also we do turn off some of the like updates of Xcode and like Java whatever. Sometimes we had a our like sister team, had a experience when they run Mac, sometimes one day, there's sunlight breaks because there's a pop-up screen of Xcode update and prevent you from doing further buttes before you actually click the button of yes.

B

So we do disable those like updates and our system. Admin will actually had. No, the regular updates of the image itself.

B

B

B

Think in the data center we have hardware engineers to help us turn it on initially and we keep it on like forever. Afterwards. If we accidentally turn off one machine, we have to file tickets to ask people go there and the press. The button for us.

B

And that's it so we are part of the developer infrastructure team within Google and we worked on the beauty and release services for Google and currently we are managing the hosted Jenkins instance in Google for different teams, and this is John our manager and that's David and Shane. Who is the new team member joining us and some of the there are Sabu's mercy and Patrick? Who is our assistant admin, but they cannot extend today.

D

Can you say what services.

D

B

um Since our ode, existing party rule was maintained, target the iOS viewed Windows and Linux, and those are mainly for the clinic binaries. So the major team who use our service right now is, for example, Google Drive, sync, those Carnot's running on Windows and Mac and Linux, and also some other team like Google Earth's. They have through package binaries and the deployer user.

A

Anything that says when it comes up at the screen that says signed by Google calm now, all of those things fit in our use case. So when you're installing something and it says, signed by Google, it came from our service now either some of them are still being built by the old version, the Mac ones and the iOS ones. But all the Linux and windows are.

B

The master so actually only hold a reference to that lovely. Yes, the.

E

Link to them no.

B

It's it's not her file level, so the external service which serve that blog actually has a UI for it. So we actually only save the URL of that link and then which point user to the presentation of that log.

B

Within Google, I would say within Google. Our service is a very small one. Only hosting for the clinic binaries and Google has a gigantic CI system, which is only for like web services yeah, so that service is also like Jonas manager of that service, and that was like I can't say like how large it is right now, but it's much larger than our service right now. There's.

A

B

Kind of a since we kind of invent our own where, when the part line isn't that mature, so we still kind of consume one single boot script and run. Maybe one argued for you, but we do have the plain try to move to pipeline, because Jenkins community has spent a lot of effort to make that work and we might integrate with pipeline in future.

B

That's right, um I think if we feel that our plugin or our work actually fit both internal and external use case, who will definitely open source that we already pushed some of the our changes and plugins to the open source, world and I. Think Mac OS won't be a exception for that.

A

If you're interested in doing more detail about.

A

It and figure out where the value is for other people. You also eventually, like start posting this for our customers, but that's a dream.

A

If we start hosting it for our cloud customers, it will definitely be hosting max as well, so they'll just be able to spin one up and have it work, whereas right now in cloud, you can't find a Mac to be to be had. So if you need to build on Mac, I think eventually, I can't say exactly when we'll have something I think a lot of people would like that. We're.

E

Not going to like it because we're going to.

A

Sell me on the back for anything.

A

You talk, then your talk about our PC slaves. Are you gonna open source, those.

B

Yes, as the original author of our PC slave, we actually have a another engineer within Google. Try to open-source that there's a external version which is like a gr, PC slave I'm, not sure about the status of that, but I do see code reviews coming in for that I might check the status and see if it's really re open source, or maybe it's juniper.

B

I would say within Google people are actually liking it because we see so many changes.

B

Everybody comes in and usually get those changes, get reviewed by appear on the team, and people can actually know ohoo change would and that information get expressed within the team, so that actually benefit the team from certain perspective, and also since the config is in the source code itself, you get all those benefits and about like trackable and make sure what's a diff between the configs and make sure no one for typos, seeing it.

C

Other internal tools or people to just met their configurations pick it up so developers.

A

Are kind of overused to that large number of users if somebody can go into UI and pickle somebody else's projects, it's bad news right who did that no idea so.

C

I run CIA Tom Jenkins I/o, which is the Jenkins project on Jenkins. We recently migrated infrastructure, so I rebuilt the instance and in LDAP we used a lot max Jenkins and we use the matrix authorization strategy, I, disabled anybody that admins from being able to configure anything and the approach that we've gone is with pipelines to say.

C

If you provide a Jenkins file, then you can have a job in this instance and thus far, that's worked out fairly well, it was before at you know the project's been going for a long time having people have access to go change these things, whether their project or somebody else's over time had resulted in a lot of untrackable and inaudible changes. So when we did the migration to new infrastructure, we actually didn't migrate, a lot of jobs, because I couldn't figure out who that help did what or why.

C

So we switched to pipeline and Jenkins files and then disabling that that edit access in the Jenkins who I met is weird out pretty well and there's some good tooling around pipeline. That will make that easier for you, so I'm. Looking forward to you guys.

D

Letting anyone touch the pipeline's generate super important breath. So we have a genuine sweet that you can't have.

A

Any jobs you have to go through our system generate your anything kind of we want to switch the.

A

Security, that's not like security to do do Sox compliance like if you're handling money at all. You have a few Sox compliance. You can't have additives to your belts chain that is not going to pass so.

B

Currently, most of the plugins run on our instance are actually written bars by ourselves because we have so highly customized environment, most of the plugins just work out of box. So in the process of rewriting those plugins, we actually will take a look and balance between different user requests, but we haven't heard use case so far when user come to us say: I want to use this.

B

That's why we kind of went to the sandbox approach within the sandbox. We actually allow user to do and it's new actually rooted in the sandbox. You can do whatever you want.

A

Or or well, we just gave everybody these funny things up close were closed, Ruth you're.

B

A

Rock and a hard place right there and that's why we decided to keep the agent pool separate from the VM Oh howdy route to the p.m.. But the agent is running in a separate detective space and campy yeah. We.

D

Have to save you baby in us, it is a wonder at the same time, one ability magnet pictures. How can you give us a little access? That's.

B

A

And that's why we do the reboot clean after every workload and you.

E

Know back to the original.

A

Image of the VM and we allow the root access only for the VM, and we only put the build in there and that's it so.

E

A

But that's why we did it. That way is satisfy those security requirements. Yeah.

B

Right, that's it thanks guys, you.