Kubernetes SIG Auth, 20 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CVE-2018-1002105 post-mortem

Description

CVE-2018-1002105 post-mortem

See https://github.com/kubernetes/kubernetes/issues/71411 for details

A

All right, this is the postmortem call on December 20th 2018 for CBE 2018, one zero, zero, two one, zero. Five on the call, our participants from Sigma and Sigma, the API machinery and from the security group for kubernetes.

A

This document that we're going to be reviewing has already been constructed and reviewed by the people who participated in the incident and those gonna be gonna, be reviewing the document. Muttering questions read questions feel free to add them to the doc or call them out on the call, and we can discuss them and then, after this call we'll be publishing publishing this document as a PDF links to from the website somewhere all right, so I'm going to just start reviewing this and feel free to interrupt with questions and we'll go from there.

A

So, just to summarize what the issue was, there was a vulnerability that allowed TCP connection reuse for connections that were proxied through the API server to back-end servers and the to escalation. Paths that were known were through the API server to aggregated api's and then through. The API server to cubelets in particular calls through the accuser, her to aggregated API servers could be exercised via discovery, api's that were by default, permitted to anonymous users.

A

The escalation path to the cubelet was via the pod exec attachment port forward api's and those are not permitted to anonymous users by default, so that typically, would be a low privilege to hike Riblet escalation.

A

There's a link to the github issue that really dives into a lot of the details about which particular versions were affected, which particular configurations so, if you're interested feel free to follow that link and find out more. But hopefully most people are familiar with that by now jumping down to the timeline for the incident, the bug was first reported publicly against the rancher component.

A

Actually, they made use of the API server, and some users noticed that, when load balancers would speak to the API server, this TCP connection, reuse vulnerability would actually manifest as a bug we're connected, we get stuck open and the load balancer would reuse it and then get error, error responses that didn't make any sense. So it sat as a bug report for a few months, while they were trying to figure out what was going on with it.

A

The beginning of November rancher devs actually root, caused it and found the bug and reported it privately to the kubernetes project security team.

A

Within a few hours, the report was acknowledged by the next day a proof-of-concept exploit was produced and the next business day a fix was opened for review.

A

You, the following day a second proof-of-concept exploit for the anonymous upgrade attack, was produced and so that lengthened the review process to make sure that all of the all of the ways that this could be exploited were understood and that we knew what the severity was before we moved forward. The following day, the fix was the review was completed and CI tests in the Private Security repository were green, with the fix so in to end integration and unit tests were passing.

A

At that point, the communications for the pre disclosure were put together and reviewed and were sent out to the kubernetes distributors, announce list on November 9th, with these time frame, releasing cutting releases containing the fix on November 26th and publicly disclosing the vulnerability on December 3rd.

A

Happened to coincide with the release of 113 as well, which was fortunate, I guess so. The issue was resolved in versions of 110, 111 and 112 on November 26th and was also included in the 113 dotto release on December 3rd.

A

Analysis of the root causes.

A

Revealed that the proxy components, the particular component that had the vulnerability had existed since kubernetes, 100 and originally was used for pod, exec and port forward and then later was used for additional proxy use cases, specifically the aggregation when that was added in the 1 6 1 7 time frame those components. The test cases were primarily focused on making sure that the function that they were used for worked correctly, and so the tests were primarily positive tests, not testing error conditions or negative tests.

A

Another cause was that proxying in the kubernetes api server still uses HTTP one upgrade based connection tunneling, rather than something like the HTTP 2 proxying.

A

That means that when a request from an end-user is proxied through to a cubelet, it drops down to a TCP connection and there's no distinction made between the requests that the API server made and subsequent Quest's from the user.

A

And finally, the the method used to transfer authentication information from the API server to backends is secured via mutual TLS connection and both high and low privileged API requests are proxied in the same way, which allowed a low privilege request like a discovery request via the discovery API once that dropped down to a TCP connection to be reused. To make high privileged requests like mutating API requests.

A

So those were the those were the technical rip causes.

A

Most of the most of the lessons learned were around the actual review and release process, and so I wanted to jump down to some of the action items that came out of the technical root causes before we go through some of the existing feedback. So some of the issues that have been opened either to investigate changes or to lock down existing behavior.

A

There is an old issue for switching away from speedy proxying and systems specifically to http 2, but it is also being used to look at switching from HTTP one upgrade proxying to HTTP 2 tunneling.

A

Another item is to lock down the upgrade protocols that we allow to the ones that we expect so specifically use, maybe in WebSockets a lot of these are kind of been done in parallel. If we switch these HTTP two that would remove the need for locking down the upgrade protocols, but something like HTTP 2 switch is likely to take longer because of deprecation policies and maintaining api compatibility, and so we're looking for ways to reduce our surface area.

B

A

A

Straightforward things like adding tests of air conditions for proxy components, this was actually done as part of developing the fix for the CDE and was opened.

A

There are a couple of issues opened to track improving how the aggregation of discovery apos works, one to sort of air gap, discovery requests so that requests to the aggregator are not played to the back-end servers, but rather the aggregator does discovery doc fetching and then serves those from cache.

A

Another possible change would be to place limitations on which requests are even allowed to serve upgrade requests so because the aggregator has knowledge about the structure of the API, it can potentially make decisions about bridgework, which requests would even allow upgrade if they follow known api path patterns.

A

There is a security audit that is currently in the proposal stage and will be performed early next year, and so one of the follow-up items from this was to identify specific components like this proxy component and specific scenarios like proxying through the API server to aggregated servers or to cubelet that we want scrutiny on during that security, audit and so I'm I'm.

A

Taking the action item to coordinate with the security security audit workgroup to make sure that the list of components and the areas of the code that we want to be scrutinized are make it into that process.

A

In addition to specifically testing air conditions, fuzz testing could possibly have helped catch some of these things like I mentioned. This was originally reported as a bug that was encountered in the wild, and so it was possible to trigger with act and with accidental requests and so fuzz testing that would have triggered an air condition in the back end could possibly have detected that it wasn't being handled correctly.

A

In the interest of the attack surface available to unauthenticated requests, there's an issue open to investigating narrowing which endpoints are allowed to anonymous users. Currently, unauthenticated requests are allowed to obtain health information, which is basically an up or down signal at the API. Server is healthy version, information and discovery. Information, health and version are likely to continue to be allowed because those are frequently used by load balancers that don't have the ability to authenticate, but those are much better understood.

A

Api's, they don't involve boxing to other servers, they don't have complex handlers and so that issue some two one. One five is open to track: investigating removing discovery permissions from anonymous users by default.

A

Seven two one one: six tract investigation of adding binding requests, finding information so that when the API server acting as an authenticating proxy for words or requests to the backend, it could add information into the request to indicate what.

A

Particular URL it was intending to forward to so that, in addition to the channel secured request, the the in TLS secured request, it would be something particular to the request in the headers that would let the backend verify that this particular request was intentionally forward.

A

You seven two one, one, seven is open to investigate alternatives to practicing through the API server for things like pod exec. This has long been discussed and there is interest in it. It is unclear.

A

Whether the security posture would be improved just by splitting out the proxying to a different component, if that component was required to have the same level of authority to the backend. But if you are interested in that or have ideas around that, that is an issue to follow and comment on.

A

Alright, that is the end of the technical action items. The remaining ones are around communication and the release process itself, and so I'm, going to jump back up to the lessons learned section I do want to pause there, just if there are questions about any of those action items or other other points that people wanted to raise on the technical side before we continue.

A

C

I had a extra private or public I'm. Sorry can.

A

You repeat that how to export private or public there exploit now? Yes, are they pubic or primate? The proof of concept exploits have not been released. I've seen a couple that have been produced by third parties, but the product security team has not released our proof of concept, exploits.

A

Okay, thanks mm-hmm all right. If you have other questions or would like to follow any of these issues or comment on them, please do so in in the issues. I'd be great, all right, jumping back up to kind of the feedback on the actual review and release cycles, so things that went well.

A

The private review and testing win smoothly. That's good! That hasn't always been the case. The timeline that was provided to distributors to test and release the fix within their distributions feedback was that the timeline was reasonable. It was particularly tricky just given the time of year and that holiday freezes and Black Friday, and things like that were in play and so trying to navigate those in addition to getting the fix out as quickly as we could and making sure it made it into one. Thirteen as well.

A

Those were kind of the things we were trying to balance, but the feedback was that the timeline ended up being fairly reasonable coordination between the security release, team, the product security team and the release managers went well.

A

We used a sort of a single canonical release plan dock to coordinate that make sure everyone understood what was happening on what day that that went well this time and then finally, the embargo on the pre disclosure during the two weeks when it was disclosed to distributors before the public releases were cut, embargo was maintained, which is appreciated and and gives us confidence that we can continue to make fixes and coordinate with the distributors and get good information out into the public responsibly.

A

All right so some things that we got feedback on that could have gone better. Not all of this was publicly visible, but the initial announcement to distributors had a cbss score that was high. It was still critical, but it didn't consider the ability to impact workloads availability via this vulnerability, and so we had a follow-up that increased that score to the one that is in the public issue. Now we're still trying to figure out how to map some of the metrics for these vulnerabilities to the cbss score, given that different configurations might be differently affected.

A

The public releases were still cut publicly so on the 26th. The fix that had been tested and reviewed and given to distributors was opened into the public repo, and then there was a time delay of several hours. While that merged and see I ran and build, artifacts were produced and rpm and Debian packages were built and pushed. We would still like to shorten the window from when a fix is publicly visible to when they really started backs are available.

A

This was a minor issue but see I in the private repo continually atrophies and because we aren't continuously cutting security releases. Thank goodness, it means that when we do jump into the private repo to do this, it was usually ci fixes that need to be made. So that delayed us about a day. If you look at the timeline, the review completed on the 6th and it took us until the 7th to get green CI I was trying to get that working again.

A

While we were cutting the releases, the patch managers for the three releases are in drastically different time zones, and so there was some coordination difficulty around making sure that we weren't making him stay up until midnight just to cut a release. And so there was some handoff issues there that we worked through, but required a little bit of scrambling at the last minute.

A

This was more public-facing. If you look at the public issue, seven one four one one and look at the number of times the the description of that was edited. The initial announcement and text tried to lay out which versions and which configurations were impacted, but a lot of questions, kind of revealed that there was confusion about that, and so part of this was because there were two escalation pads with two different permission: levels that affected two different version ranges. So the cube, cubelet pod, exec attached path affected all the way back to one.

A

Oh, the aggregated, API, server, path affected back to I, think one six or one seven, and so because there were two escalation pads for two different version ranges with two different permission: levels that just led to a fair amount of confusion.

A

There are also questions about which versions were vulnerable for users that are using commercial distributions, so not consuming the open source release artifacts, but consuming something from a commercial vendor, and so some of the follow-up items for that were to be sure that we include very concrete steps for determining if a cluster is vulnerable, not necessarily proof of concept exploit. But if you look at the what we ended up with in the description, but we ended up with was a lot more specific than the original text.

A

So something like if you are at above this version and you have an aggregated API server and this command returns.

A

Aggregated API services, then you're vulnerable, and so one of our follow-up items is to make sure that the text around affected configurations is very specific, very concrete and whenever possible, provides commands to run to diagnose whether a particular cluster is affected.

A

One of the slightly odd things about this particular incident was that the the bug had already been reported publicly, but without reference to its security implications.

A

Additionally, the rancher developers, once they figured out what the issue was, had fixed the issue in a patch again without reference to the security implications, and so we found ourselves in kind of the odd position of pre, disclosing the security issue to distributors and giving them a patch that was essentially already available publicly elsewhere. And so we couldn't in good conscience, say that the patch was embargoed, but the security implications of the patch were embargoed, and so we got feedback that that was difficult to understand and some very security-conscious, distributors.

A

Probably you went overboard in how cautious they were to the point of not even giving their team's the patch that we had sent them because they didn't want to break embargo, and so we we definitely heard the feedback that the communication with distributors needs to be very explicit, very clear. You may do this, you may not do this, you may produce builds, you may not produce builds. You may install those builds in hosted environments. You may ship builds to end-users, whatever is or is not allowed under.

A

The embargo needs to be very explicit, and so we will be updating the announced templates that we use to send out pre disclosures to to be explicit about what may and may not be done prior to the embargoed lifting date.

A

Another point of feedback was that coordination between distributors was difficult again. People were being very conscientious, which is greatly appreciated, but in cases where there were two distributors that were both on the pre disclosure list and also have shared customers or installations, it wasn't clear how they were supposed to coordinate on the issue while under embargo. And so one of our action items is to make sure that there is a communication channel for distributors on the pre disclosure list to to coordinate prior to embargo. Lifting.

A

This last last bit is probably not interesting to anyone except the product security team, but again things atrophy and since the last time we had a security release, some of the channels that we announced things on had changed. So the kubernetes users group got archived and the discuss forum got created and we didn't have posting permission to the right, the right places in the forum or the announcement slack channel, and so it's kind of inside baseball, but making sure that the people who are supposed to send out announcements have permission to do so.

A

And if we jump down to the follow-up items, some of the things around process that were wanting to take into consideration having multiple patch managers per release, which is something the release team sig release, had already plans to do for 114, which is great having having backups for that is great spreading. Those across time zones would be even better if possible, and so that's being added to the things to consider when picking patch release managers, some follow-up items to simplify and clarify how the distributors announce list is maintained.

A

There were some questions around which companies were on the distributor's announce list, and so we have an action item to figure out how to publish membership of that list just for transparency and to demonstrate that the requirements, the public requirements for being on that list are being followed and make it easier to to coordinate and inform people who want to know who has.

B

Access to that information.

A

And then the the things that I mentioned about updating our templates to make sure that we are as specific and concrete as possible with the information we get people, oh.

A

That covers both the technical and the process, feedback and action items from this incident. I will pause again and just ask if there are any other questions or pieces of feedback that people would like to consider.

B

So I have a question sure which is the like that, like that, sometimes things take a long time to realize there was security incident and I think that with kubernetes like anytime, you have a platform, that's open source, that's an other people use and then they get bugs right.

B

There's always going to be this challenge, and so I'm just curious whether there has been any thought to- or maybe there already are structures for, like kubernetes people in who are deeply involved in kubernetes, somehow participating being on the lookout for bugs that are reported in the like.

B

A software that uses kubernetes that happens to be like you know, be visible in some way or you don't I mean like some faster channel for not like I, think if it was like not specifically forth security, but for bugs in general that with some security people looking on, there might be kind of a faster channel. If that makes sense, yeah.

A

That's an interesting question: I I! Don't so! If you look at our our dependencies, we actually do monitor our dependencies for security vulnerabilities but kind of going the the other direction and saying oh, this thing is built on top of kubernetes and encountered a bug that maybe look suspicious I I. Don't think, there's any monitoring of that the volume that that would involve, probably isn't tenable. Given the current number of people who are watching these issues of.

B

Course I'm just thinking about it, like you know, in the parallel- and you know if, like if kubernetes and all of its all of the platforms that use it we're part of one company, this type of thing would be flagged by support right support would go. Oh somebody reported about me. Actually, maybe this is something that is. You know actually belongs in a different component right and then those channels and discovering security things quicker right.

B

Then, then, though, it happens, an open source, and so if there are okay, like I'm, just wondering whether there are groups of people right who are more in the end user community aspect right who are thinking about just the the massive bugs that kubernetes, like any firm, has right and and and the users of that, and if there are any pre-existing structures for corralling that feedback that we could insert some security experts into, or some training and two or something so that the people who are maybe not specifically looking out for security, might learn to spot this kind of thing, yeah.

A

I think the two dimensions of that one is distributing across companies or products.

A

One of the goals actually of the distributors list is for it to become a two-way, a two-way street, where it's not just here's a way for the super just to find out about issues, but it is increasing awareness of the security process and trying to get in place identified people for all of these different distributors and products that are thinking in terms of security issues, and so, as we add, more companies and products to that list, it helps us kind of get points of contact for those for those products and companies and build relationships where, if they see something, that's odd, they they have a working relationship with us already and know how to go about getting that information to us.

A

The the other dimension I said there were two dimensions and now I've forgotten. The second one I think.

B

The distributor list is the direct yeah ship with a company right and then I think the other.

C

Dimension, oh I, remember.

A

One of one of the other, the other dimension, is across like the kubernetes product project, not just in Zig off or this product security team. We're looking at the process that changes to kubernetes goes through, so the the kept process, kubernetes enhancement proposal. Looking for specific questions and evaluation of security impact of changes so trying to make sure that this is something that everybody in the project is considering as they're making changes or working on issues, and so that dimension is trying to improve.

B

It security being something.

A

That everyone in the project is aware of, and cares about, great yeah.

A

All right well, if no one else wants to mention anything now these issues are open and available for following or commenting on, and as always. If you have other ideas, please feel free to open issues or contact those security products, security teams.

A

This document will be published and made available in kind of a final form on the website. I will send a link to that out to these same three sig lists and to kubernetes tab. I want to thank everyone for joining and happy holidays. Thank.

C

You thank you very much thanks or thank.

B