GitHub GitHub Satellite 2020 - Work, 7 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Credential mitigation in large-scale organizations - GitHub Satellite 2020

Description

Presented by Tobias Gabriel and Nikolas Krätzschmar, SAP.

GitHub Satellite: A community connected by code

On May 6th, we threw a free virtual event featuring developers working together on the world’s software, announcements from the GitHub team, and inspiring performances by artists who code.

More information: https://githubsatellite.com
Schedule: https://githubsatellite.com/schedule/

A

So what we have coming ups access, credential large-scale organizations and we're gonna have two amazing speakers joining us: Tobias Gabriel developer at sa P yeah. You know me: I can't wait for that. Ai rap battle, I'm! Sorry, it's been on my mind all day: 12:40 Pacific, gonna, wash it on and Nicholas catch, Meyer who's, a student s, AP developer and they're gonna. Tell you all about what they've been learning and using how to make it and get help more effective for people that are learning so take it away to Baia San Nicolas.

B

High and vacant from my site as well today, we want to give you an quick overview in the next 20 minutes. What we learned from our adventures in credential mitigation, at s AP, and what things we did there s vs. Dylan topic in motion for us. We will not go into too much deeper detail about the actual findings, but rather focus on the tools and processes we used for that, the Fed, let the that being said. Let me quickly introduce ourselves.

B

My name is Tobias and I'm a developer at the tools team at sa, pier, where my main focus is on everything around github from style administrating, our internal github servers over doing git and github trainings, as well as doing working on cross topics like this. One I'm today joined by Nico, who is a master student in our team and did most of the technical implementation of the scanner and they'll also later go into detail about that.

B

We both work for sapa, which is one of the largest enterprise software vendors in the world and has over 30,000 developers with that many developers. We also have a quite big codebase. Currently, we have over 250,000 repositories, hosted on our main github server, as well as roughly five terabyte of compressed source code with a push to Mariner source, meaning that we would like to open up more of his repository so that every colleague can see them and reuse code. We wanted to make sure, but no credentials by accident get leaked there.

B

It happens and probably already happened every one of you myself, including, but you by accident, commits on fire, push it up and then never for never come back to clean that up. So you have some password or some credentials leaked via and by limb private repositories, but not critical. When opening up were two organizations or whole enterprises, we wanted to make sure to reduce the risk bear as much as possible.

B

However, the that many repositories, our challenge was now to firstly figure out how big of an issue this is and how many credentials are there actually in our source code and to focus on that I'm now handing over to Niko who they go into details, how we implemented what things we had to keep in mind. Where and what consideration we needed to take from there Niko take a bit alike: yep.

C

Thanks to be us, I'm Nicolas also welcome everyone from my side, ok, first to define what we were actually scanning for. We opted to limit our search to just static patterns that can be easily identified by a regular expressions, as this appeared to be sufficient for most types of authentication tokens. For example, all AWS keys always begin with the same leading character, sequence or take google cloud certificates.

C

Those are JSON objects that have some constant fields, so you can easily match against those also, and mostly just out of curiosity, we looked for Bitcoin private keys just to see if anyone actually committed those scanning for secrets in git repositories is not actually a new problem, so there already exists a handful of tools out there to do so, and even just its lock utility already sa built in parameter to perform reg X matching.

C

Then there is a pretty nice tool called get leaks, which uses a slightly more sophisticated approach to identify secrets than just plain regular expressions by also employing entropy measurements and now, while both of these tools are great, and we would in fact highly encourage anyone facing a similar problem to give them a try first, they were simply not performant enough at our scale. So instead we decided to implement our own solution, drawing inspiration from those existing tools, but with a more performance, focused approach.

C

The first challenge we faced in implementing this was how to extract all the repositories content across all branches for the entire commit history, as it offers multiple ways. You can do this.

C

You can use git lock to display the history and it will show you the petrous between each commit or, alternatively, KitKat file may be used in batch mode to just dump the contents of all block objects, so essentially a snapshot of each file after any changes, and not just the changes that occurred to evaluate these two options, we compared them against each other on a set of hundred test repositories that we randomly sampled from our entire code base and, as you can see here, the git cat file based approach is significantly faster.

C

However, this comes at the cost of producing more output, meaning a substantial amount of additional data to be scanned in the next step. Therefore, whichever of these two option is preferable will highly depend on the regex scanner, throughput capabilities to perform the actual pattern matching. We looked at various reg X tools out there and during some initial research we quickly discovered that standard grep was just not going to be up to the task, primarily due to its lack of multi-line pattern support.

C

So instead we looked at pcre grep as a mostly comparable alternative that does indeed support. Multi-Line mode, and also some more complex patterns with that, we were able to scan all blobs from the hundred test repositories in just over a hundred seconds in this process. We also got some support from Laos from github professional services team, and he pointed us towards using Intel hyper scan library, there's a high performance, regular expression, matching library that works by recompiling patterns and tuning them to a specific CPUs microarchitecture by using vector instructions and some other magic optimizations.

C

This is also by the way what was used for token scanning on github.com. As you can see, this yielded a huge performance boost, bringing down the time to just 17 seconds to scan the blobs of our 100 test repositories.

C

Now, looking back at the two options discussed earlier for how to extract the repositories content, it becomes clear that the scanner throughput is not the limiting factor. Therefore, the option of just outputting all objects, so the KitKat file based approach should be preferable because the slightly longer time needed to perform pattern matching on the additional data is more than made up for by the time safe, not computing differences between files and putting this all together. It takes 40 seconds to both extract and scan the contents of all hundred repositories.

C

Another observation we made was that of this time.

C

Actually, a pretty large chunk was spent on just a very few individually large files and those were oral, binary, blobs or some otherwise auto-generated files that probably shouldn't have been checked in to get in the first place, but that's all another problem and more importantly, they didn't contain any secrets or if they did, those were all just false positives, so we decided we could safely ignore them and we did that by implementing a filter to skip over any file larger than one megabyte, and this allowed us to further cut down the runtime to just 22 seconds.

C

Okay, with all this in place, we decided to run this on our entire code base of over 250,000 repositories and at first we considered cloning them to a separate machine to perform the scanning there like we did for the test repositories, but, as you can imagine, that quickly ran into a couple of problems, primarily how to get access to all the repositories, including the private ones. But more importantly, this approach would essentially be equivalent to spamming. Our own github instance also would just take too long.

C

Just a copying probably takes longer than the entire scanning, but luckily we already have a server with a full copy of all repositories, our backup machine running on the backup we have direct access to all repositories while at the same time not taking away resources on the production github instance on there the scan drop is run in parallel on a per repository basis and the main threat gets the list of repositories to scan and then assigns them to the workers sub processes in a round robin fashion and doing it.

C

This way we already chief good enough load balancing, so we saw no need to implement any more complicated scheduling techniques.

C

With the setup running on 128 worker threads, we were able to perform a full scan of the entire five terabytes of compressed repository data in just four hours and that left us with a list of findings, each potentially being a leaked secret. Now, with this I'm handing back to Tobias, to talk to you about what we did with those findings, how we post process them and also about some of the non-technical observations we made rolling out this tool.

B

Yeah, thank you very much. Nico for that and with the skin are now in place and we actually run the scan on a daily basis. We are able to get every day a list of tensho secrets which match our given patterns and probably to no one's surprise. We found quite a few more than one and actually so many, but we didn't want to manually, implement and follow a process or sent around some excess, or things like that.

B

So we needed now to take our findings our match patterns and go into more detail about them of what they are if they are indeed valid, and things like that to show you at one example, what we did is select their books. They look like this and already contain an a secret value at the end which you can use to post to a specific channel. So you only need that this URL and can do in progress and the message gets sent through a channel without any further information.

B

So this is a credential and you probably don't want that accessible to everyone. While it is not the end of a word, if somebody has bet they, they can still spam your channel- and you probably don't want. But if you just look at this tool and I can say that one is Method and the other one is invalid. You don't you can't see which one is actually still valid and should probably be mitigated.

B

However, what we could do is we just try to send a message to them and if we see that it succeeds, we know it's valid and if we see an error message, we know that it's invalid and we actually did that and saw that one of these two they are still valid and should probably be mitigated, while the other one was already invalid or never valid in the first place and didn't need mitigation, and with this we solve in our case, but the majority of the select web folks we've matched with our scanner previously, where indeed active and Bennet, and should be mitigated.

B

The same process of verification can be also applied to ABBA. We call them Service credentials which are for accounts which, from a central service like cloud accounts, GCP accounts or AWS accounts, or even the Bitcoin Keys Niko mentioned earlier, because Bitcoin keys are in base 58 encoded and you can match for them. And if you receive a list of strings, you can try just try them out if they are valid and we actually found a single Bitcoin key in our code base. Unfortunately, the corresponding Bitcoin wallet was already empty.

B

After having now this list of verified credentials, the next step would be to start and mitigation process and to quickly summarize what we have. Now. We have an scanning process in place which takes the data from the Becca scan spam on regular basis and then tries to validate them if they are service credentials against a central service.

B

However, the that many findings, and so many development teams we didn't want to create Excellus or manually, send emails around, but rather opted to implement a full service for wet because it outed service which takes in these findings and then notifies responsible service owners, meaning if we can identify them cloud account owners, slack owners and things like that, or in case of more generic secrets like the LS, our keys or private keys or even passwords. We opted to notify every responsible repository owner, so they can review them and decide if they want to mitigate them.

B

Even before Toton scanning github announced earlier today. We think that it can replace some of the parts of the scanner but with, and then we can focus more on the specific thing things like my audit service and they actually noticed when we started rolling out by audit service to our development teams that we received quite a bit of feedback on what things were good and what things were bad and so I wanted. To give you also now a bit more insight into things.

B

We found to be important and consider rating that the most important thing we noticed is that you need to try to have as few false positives as possible. Probably already all everybody had in security scanner that sent out like 300 messages and two of them is valid, and this ends up with development teams, ignoring with messages. So the of highest importance to us was to ensure an SI occurs, city as possible.

B

One thing to actually that is trying to verify credentials as soon as you have them, even with checking against a central service if they are for some servers, or even in case of ellasar Keys checking if they are invalid error sake and not only matching in pattern.

B

What we also notice is that a lot of credentials are in dependency, folders like vendor, for go or not mod gears for no GS and probably imported from other sources like github.com, and so we opted to exclude them as if they are valid credentials. They probably should already been mitigated at the source, and we didn't want to include them again to send out the second thing. If you send out notification to people and our development colleagues was that the first question we receive yeah and what should we do?

B

No so very important was to include relevant guides directly where the notification is as fair, so things like cloud accounts how to check the audit lock and what's suspicious activities to watch out for how to rotate relevant a cloud accounts as well as where to store our credentials.

B

Additionally, what is important is to have somebody as in contact where teams can reach out in case they have IVA questions or if they noticed abuse of some accounts, so that this can be escalated and properly handed and doesn't end up in a void. And the last thing we implemented mainly to track progress. Our safe was an automatic revalidation of all and potential patterns be matched every day.

B

So we have been every day and list of credentials which are which we know if they get mitigated on net and as well take the burden from the response of their development colleagues to manually flag for dentists as mitigated or not, and don't bug teams about credentials which are already no longer valid the Fed. We can also make sure that, if credentials don't get updated or mitigated in time, we can either take a look at them ourselves or escalate it as necessary and professor in mind. This concludes our presentation.

B

I hope you could take some things away when implementing it yourself or then looking for bet and I think we have another ten minute for open questions. Now,.

D

Thanks coach thanks that was fantastic, so I really appreciate the attention we had lots of.

D

A

A

Curious way of that, oh yeah.

D

No, it was the push day and the sauce was really good to hear about. You know with how we how we, how you had do it in real life and we've got a session coming up. Gosh is very early in the morning our time in Europe here, but Denis coop is coming on later on. Today, he's obviously mastered min the source, and you know talking about how in a source is going to be good for free and open source sustainability. But what was fantastic really was to kind of see.

D

You know si P doing in a source for real, and then you know of the things you have to take care of. So that was amazing. Obviously you know doing a bunch of reg X scares me. Well there we go. What about you, Dana where's? Your highlights.

A

I'm gonna tell you finding that Bitcoin key I'm, just bummed that y'all aren't crypto rich. You should have been crypto rich. You just stood up in crypto ready now everybody watchin part, and then you actually something how accept how excited did y'all get when you found that? Oh this is my own question outside I would have been ecstatic good, yeah.

B

I, actually, when we found that we found rather many patterns which match them so I was surprised how many were there and when we implemented some further verification to see how many of these are actually valid, matches or just false positives and been starting that and applying that. Unlike the first hundred matches, we saw no positive. I was already pumped and then at the last point. We still found one and then I also copy/paste, but over to check if there's anything on the wallet and yeah.

B

But it's probably a good sign that we didn't found one. So at least.

A

I would have found that key.

A

Sources solution.

B

Yeah, so currently we don't have concrete plans to open source. It mainly out of a reason that we have a lot of very specific hex in there which apply to our specific use case and our internal services. But if you are interested in some of the more details, I think we can write up some of the more generic things. So the core implementation of the scanning is rather straightforward and I. Think we can see if we can provide for something but I'll take a look at that in detail. No ABI.

D

Fantastic, we had a question also from Randy our 5:05, and they were saying successive scans only faster than the first. You know the one that you mentioned took four hours by eating any check-in changed repos webhooks. That sort of thing.

C

Yep not with weapons but we're looking if any fault within the repository actually changed, but then so far, if anything changed and we're still rescanning the entire repository, but only the ones that changed got.

D

It and then we've got another question here from K robots. They were asking. Is it possible to have these scans run before users are able to make pull requests? So is there any way you know it's like a pre-commit hook, that sort of thing to stop them getting into a branch, or what do you do if they're in in the pull request, you know, are they not in history.

B

Yeah so the first thing and I already plan to include that in my presentation, but actually missed there was that, including on scanning for credentials in repositories only the second step. The first step is that you would like to prevent them getting there at all. So you don't need to trigger an mitigation process because as soon as credentials on github, you should probably need to rotate their credentials and not only remove them from the history, the vet being said, for example, the get lock utility or, if I, remember, correct.

B

Every AWS labs also has an tooling, but you can include as a pre-commit hook to scan for patterns and then decline the push or the commit directly on the development machine.

B

However, the scanning solution we implemented was more focused on performance and, as Nico mentioned earlier, is rather optimized on the skin infrastructure, so not an tool which every developer can run on their own on their own, but yeah and.

D

The that's really interesting! So do you have any? This is a question for me now not from the audience. Sorry, but anyway, I'll get back to you in a minute folks. So do you have any tooling that you use you know when, for developers to kind of sell their local get environments and get their you know, their email addresses settle properly, get there any pre-commit hooks anything like that sort of property in their local environment, yeah.

B

So we have don't have an fully script to run it, as most people are already grabbed a family of git and how to set it up. But we have an very handy, Quick, Start Guide, which says bitch' email address you should set and how you set it up. So bad development colleagues can quickly get ramped up on git and the great thing about git is actually but most of the people actually I.

B

Rather family love it already and don't need a long onboarding trail for that, so the internal documentation we have around it is rather limited and fear would fare quite a lot on already but open knowledge which is available, publicity, and so that's also a great thing about git, but everything you can find online and don't have to have some pay. What tools around that? Yes, that's.

D

Fantastic, yellow and so I helped how nice their work over Microsoft while ago and I helped out with that, when the windows team were adopting it and one of the things that they had was some posters around. You know the five stages of git as they were coming up to speed. Instead of the five stages of grief, you know so it's quite interesting source in the training that different people need as you migrate them, and it's brilliant that your community we're just able to pick up and run. So that's fantastic!

D

Another question here from Jamie Sloan: have you looked into how you can work closely with service applications? You know seeing the development communities more and more moving in that direction, that you are you looking into service much yet.

B

Generally speaking, or in the context of creation,.

D

Hp in mutton yeah, no, not really with credentials because I think in terms of credentials, it's not much different is it's just in the config files and things it's in general yeah.

B

Yeah, so me myself not directly. There are parts of s AP which look in Tibet but I'm, not the correct person or a correct team to talk too much detail about that. One yeah.

D

No worries Dana so.

A

Go Lincoln H want to know, can you detect API secret keys? I could answer this, but I'm gonna. Let y'all answer this. One.

B

Nicole want to take that back.

C

Pencil kind of API, like most usually of like a fairly constant pattern curve that you can match against. So most of them we match was pretty good accuracy.

C

Api specifically yeah.

D

I Niko how much yeah, what's it like, you know, you're a student in between and you're working s, AP now as well. How have you been finding this? What's the what's, the thing you've been learning that you wish to kind of taught you at school. Yes,.

C

Really nice, like both they are to actually program properly, like they probably do at university, usually far less advanced, also I think I probably learn more while working than at university. Actually because, just if you have a problem and then you're researching information for this, you directly know how to apply the new knowledge. So it just sticks better I. Think.

A

You know I I just had such a sweet spot in my heart for students and learners, and you hear you know no matter if you're going to university or you taking an apprenticeship or an internship, you know I love our github interns get enough. Real-World experience is so awesome and important. Getting to do real cool stuff in real life, I mean for me. It's like I, know, I think about it. There's been like 20 years since I've had actually you know, do some studying in university.

A

I know it's hard to imagine but I just getting the opportunity you stuff for real and live in that size of an environment, and it is so important. The word that y'all are doing or doing so I'm important. You know as a VP of engineering like is one of the things.

D

That keeps me up at night. It's.

A

Like oh, no, do we accidentally put a key into a repo because it happens like if don't if.

D

Any developer out there says.

A

Oh I've never done that I'm calling I'm calling you know, I'm gonna, be yes on that, like I've, even done it like you ever been well, I bet Martin's done it a whole bunch too. Okay, maybe I'll maybe do.

A

This I think this is the future. You know having this end in automation that secures the world's cold. It's just it's just fantastic, more questions. Coming in for discussions, yeah I think we have time for one more well.

D

Let's just say one more question, so the scanning that you've got and does it only work with github Enterprise Server or would it work with you know, google.com Enterprise, accounts or or even say, github teen account.

C

Act, the actual scanning port that we implemented- it's not your top specific it just uses like just calls to get so that can be run on anything. But then some of the metadata extraction part are we doing to find who the repository owner is and that kind of stuff that's get up, Enterprise specific, but with reusing it on get up comm.

C

You would then actually have to first, obviously clone them all to a separate machine because wouldn't have quite a nice performance a way of right now where it can leverage the fact that we already have a full copy on the backup machine.

A

Right I'm. Well, thank you. Thank you so much again for joining us, tobias, and it was a great and interesting talk. We're so happy to have you here with us today, yeah.

B

Thank you too,.

C

C