Kubernetes SIG Instrumentation, 11 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20200611

Description

SIG Instrumentation Meeting June 11th, 2020

A

Hey everyone uh today is june. 11Th welcome to the sig instrumentation uh bi-weekly meeting. uh It looks like we have four items on the agenda. uh Frederick, do you wanna take the lead on.

B

Yeah, so I think this is um something that's actually been in our backlog for quite quite a long time and we just had not had a chance to discuss it. I think the kind of uh summary of every of this this issue is um currently the path of logs on disk, only contain name and part name, namespace and part name, um and not endpod uid, but not namespace uid.

B

If I'm, if I remember correctly and kind of the problem here, was that um just like pod uid, it's potentially possible to have clashes with namespace names as well. It's unlikely, but it can happen and kind of the proposal that was proposed here is simply to add namespace uid as well.

B

To make sure there cannot be collisions um so yeah. I kind of this seemed pretty reasonable to me, but we had never discussed it. So just kind of wanted to wonder was wondering what other people were thinking.

A

uh What is this going to break.

B

If we change this, so my understanding is, we cannot break the existing paths, we would have to make them uh additive and then potentially in the future, deprecate the current ones. But I think david would be better to answer that.

C

Yeah, I think we would have to sim link a bunch of stuff we've done this sort of thing before. If I.

C

B

So in general this seems pretty reasonable to me. So if folks could maybe I mean we don't have to discuss it at length here, but both could maybe read through the issue and comment what they think then maybe we can take this to sig node and discuss.

B

A

Yeah, I think that sounds reasonable. uh Is it okay? If I sign you up david for this, I feel like you probably.

C

Have I will I will bring this up at sicknode? Okay, so you can put my head down.

B

I will thank you that was already it for that item.

A

Okay, uh the next thing is uh the api call latency slo. We were actually just talking about something related to this, but um so currently scalability defines a bunch of slos um on the api server and.

A

uh These slos are heavily asterisk, uh which mean which means that, like basically, we are guaranteeing um that our our response rates are going to be at a certain certain speed. uh A certain percentage of the time.

A

Unfortunately, we we can't actually guarantee any of these things, because um it is possible for people to configure arbitrary web hooks.

A

A

The idea was whether or not it would make sense to partition request metrics in such a way that basically.

A

Allowed us to actually make guaranteeable claims like, um for instance, by um by having something like internal api server, request, processing time.

B

Correct me, if I'm wrong, but isn't this slo primarily just used to gate kubernetes, whether kubernetes releases go out or not.

A

B

I well it's used for scalability tests, yeah. So to me to me it sounds like then, if you have um like web hooks that may interfere with this metric, then you have to adjust that slo. Accordingly, um if you, when you offer kubernetes as a service or um whatever, like you're you're, still actually interested in starting actually interested in the entire time it takes for you to serve your users right.

A

uh Yes, but these, if you look at the top level of that directory, then you will see a slos dot markdown, and these are not like gating releases. These are user-facing slos like these are like uh it.

A

This is what we are saying: we are guaranteeing a user.

B

Which doesn't make sense? Because.

A

We can't guarantee.

B

This exactly, I think, that that's kind of that that's kind of the problem and that's something we need to fix, but I think uh segregating the metrics is a possible solution. I personally don't think it's the solution that we should go with. I think we should just change the wording on that and say this is our guarantee if you do not have web hooks, you need to think about what overhead a web hook introduces when you guarantee something to your users.

A

I mean currently it's impossible to know how long an internal request takes in the api server right. Like um I mean we have web hook, latency metrics and we have request metrics but they're, not really joinable, right like you, you can't really.

B

uh What what's the actual problem that we're trying to solve with that like.

A

We can internally define what the software that we are responsible for is actually doing right, like like the thing that we are promising is like. We are promising things that are like outside the scope of things that we can promise right like. We should not promise things that.

B

But the the if web hooks are configured, users are still gonna gonna um experience, what they experience right and that is still correct to alert on. um But I like it is.

B

I do agree with you that, like we need to have a mechanism to say, wait a minute, the the yes, the um the requests did take this long, but it was because of the um web hooks and not because of something.

D

B

Api server, that's what we're trying to answer right. um So typically that is done with or the way I've seen. This done is with like histograms for the outgoing metrics um that goes to the web hooks, and that is totally something that you could could inhibit api server alerts, for example, um if that particular alert is firing, does that make.

B

Sense, I mean I don't know if, if other monitoring systems have um this kind of capability, but in prometheus, for example, you would say: okay um alert for uh webhook uh responses. Taking long. uh If that's firing do not actually notify me about um api server latency high, because I know that that's gonna be high because, because of the web hooks being high.

A

uh It gets kind of complicated, though, when you have a lot of webhooks.

B

A

Because yeah they're killing them they're they're killed of latencies right like you can you can literally have hundreds of web hugs so like you can, you can actually be in the situation where one web hook is not alerting, because the latency is actually quite reasonable for that web hook. But the aggregate total of your web hook processing time is extraordinarily high.

B

You could say the same for any aggregated api, no.

B

Like they say that you like, if you have an aggregated api, um you we also um track the entire the latency.

D

B

Yeah and it could take arbitrary long.

A

Yeah for aggregate api yeah, because it's.

B

Yeah, it's entirely independent yeah different api server yeah. So I mean what you could argue that that's equally as much as uh as much a problem as um web hooks, which I I guess I agree with.

A

But we have, we have component, we have component in our request, metrics, and that should give you whether or not the the endpoint is uh so it's actually not not the same, because we can actually distinguish the two right now. Okay,.

A

I I mean if people like look, we have these publicly published slos with asterisks and.

A

I mean we can do the internal thing, but uh I don't like okay if, if there's an alternative to doing internal processing time, I'm like I'm receptive to it, I I'm not sure that I at least I'm not understanding an alternative if you're suggesting I.

B

Mean, as I said like I, I I actually don't think, there's anything wrong with this. I just think that um we we cannot promise it in the way that it is stated there right now like we, we were coming from the same.

B

Like direction right, but I'm saying because people can modify, can customize their uh kubernetes in various ways with web hooks. um That does make it a custom experience and they need to figure out what the slos for that is and kind of a standard kubernetes without any web hooks. This is what we can guarantee and everything else you're going to need to figure out figure out.

C

I do think it's all right to have high level slos and then per component slos.

C

So you know, I think it's reasonable to have metrics at different levels. But I agree with the sentiment that we should make sure that we're measuring something concrete and.

A

Meaningful I would like to be able to tell, in the case that a kubernetes cluster is not performing to uh you know some stated latency metric. If this is because of kubernetes or not because of.

A

B

I mean okay, so you you brought you brought up. The example of you wouldn't know which um webhook is the is the origin of the of the problem right potentially.

A

There could be not a single web hook that is responsible for it. Yes,.

B

Yes, um so like let's say we did have the the metric that um kind of captures the latency without web hooks. That's kind of what I understood you were internal processing proposing now.

B

um What would that be? The next step that you're active for actually investigating this?

B

You would still need to figure out which webhook or which combination of web hooks um is actually causing this right, ultimately, you're still trying to resolve the problem. No.

A

I mean maybe then we just need a dimension right like internal or weapon processing, time right.

B

Yeah I mean, I think, yeah and then you can find it too. We cannot we kind of. If I'm understanding you correctly, we kind of run into the same problem um as we do with like crds or something where we can essentially now have an arbitrary amount of metrics added, because it's dynamic.

A

Yeah, in this case, we just make a a a binary.

A

It's a binary thing right like it's: either a web hook, part of the web processing time or it's the internal api server processing time right and then, if you sum the two, then you have the total request. Processing time.

B

So, are you saying, okay, you're you're just trying to make the the jump of okay? I don't need to investigate anything in the in the um api server. I know the problem is in some combination of webhooks.

A

Yeah, you know that it's in some combination of web hooks. uh You know that the total, the cumulative web processing time for this specific endpoint um was greater than what you expected. You don't like.

A

There can be an any n number of configurations of webhooks for any given endpoint and that's going to depend a lot on your specific cluster right.

B

B

That would mean we double the amount of histograms. We have per resource.

A

B

Is already a pretty huge cardinality, as is.

A

Yeah yeah, but I I think we should also drop stuff like content, type and client uh yeah for.

B

A

Yeah and but then then we just buy ourselves a lot of space right, like these are unbounded dimensions that are just basically none, so so yeah I mean like if we, if we clean these things up.

B

I'm still, I'm still not totally convinced that um that is necessarily the better thing to do, rather than a per web hook, histogram or a summary. Even.

C

D

Would generally say.

C

That each component should have its own slos that are or have its own signals as to whether that component is performing correctly. So, in my mind like, yes, you should add metrics to your web hooks so that you know how they're performing, but I do think it's also quite useful to have a signal just to see if the api server is is doing its job correctly.

A

Yeah, I mean it's kind of hard to tell right now. Is the thing look? It's all it's actually basically impossible to tell if it is a conjunction of web hooks or if it's internal api server and it it takes a lot of digging and grabbing through logs to figure out whether or not like you know what something looks like during one request path. I mean it's yeah.

A

It should not be that hard.

C

This is hopefully also something tracing will help with.

A

Yeah, I agree actually I agree.

A

uh Definitely yeah, I just wanted to put it out there, um it's something I I mean it's not something I have an answer for. I just was curious what people thought about it. um I will continue to think about it and if anybody has additional thoughts on.

B

I I definitely think the um the intention is very good. I think there are a couple of solutions that we could go with.

A

We still have uh another thing on the agenda, so we should. We should, let's skip the this week through the pr backlog and and uh alexander, do you wanna.

D

Yeah, uh can you guys hear me yes, okay, awesome. Let me uh try to share my screen.

D

uh I, for some reason I can't share my screen. It says the the host has disabled.

A

C

Think I recall in signal we always have to make people co-hosts. I don't remember if we have to do that here.

D

Or I can just uh forward well od.

A

Okay, I I I I made you a co-host, so you can okay.

D

D

D

Yes, okay, all right! Thank you. uh So uh thank you so much for uh giving uh us an opportunity to come and talk to you guys about static analysis and dynamic analysis.

D

So my name is alex czernikovsky and here today with me, I have patrick we're both part of the gte security team and we've been working on trying to solve the problem of uh reducing the possibility of uh leaking potentially sensitive information like credentials to logs and uh in the next 15 minutes.

D

uh We will try to hopefully clear up some of the questions that were raised during the review of the cap that is currently under the review that proposes extension to k-log to inject a inject a hook which will allow us to inspect the data before it goes into the logs.

D

So here's our agenda, uh as I mentioned the objective is today to uh there were some very good questions that were raised as part of the uh as part of the cap. One of the questions was: uh what is the? Why cannot like concretely? The question was: do we need to extend? Do we need to actually extend k log, which is a core component of kubernetes, and a lot of other uh components in the ecosystem depend on it?

D

Why can we not just use static analysis which does the scanning completely outside time and get similar results and just sort of tldr here? We believe that we should do both and, as a matter of fact, uh in this gap, we decided to purposely focus on the dynamic analysis. Just the k. Log expansion in our plan is to follow up with another key app that will propose adding a pre-submit check to kubernetes.

D

That does the static analysis, but, as you will see- and hopefully we will be able to explain this- that for the best results, we should focus on both uh both types of analysis, dynamic and static.

D

So excuse me, so I will explain what uh how we do static, how we do static analysis. uh We also do a short demo of the tool that patrick and I we've been working on- um that we actually been running against kubernetes and actually found some some leaks already.

D

So we're hoping that we can continue to provide value to kubernetes through the stool. Then patrick will talk about dynamic analysis and we'll talk about what we recommend going forward and hopefully we'll have some time for questions.

D

So, uh as I mentioned, static analysis is something that we decided consciously decided not to include in this cap, just to make sure that we have a actionable outcome from the from this cap, where we want, to inject a add, a possibility to inject additional processing into the k log and this processing will take care of detecting potential leaks, and it is our intention to follow up with another cap for the static analysis.

D

So when we talk about static analysis, we actually talk about a specific type of static analysis. That is called pain, propagation analysis.

D

So team propagation analysis is based on the idea of identifying sensitive pieces of information as they enter the program or a function and tracing its potential flows through the through the program and in particularly we're interested in in the events where this potentially sensitive data exits the boundary of the program. So, for example, such boundary could be a log or it could be a database or it could be an rpc call or an http outbound call and in the parlance of tame propagation analysis, the sensitive piece of data we call it in.

D

We call a source- and this this moment in time when the when this data leaves the boundary of the program, we call it a sync and basically, the idea of the same propagation now is very simple: uh we build the graph of the program and we first find the inputs and they become sort of the root of the graph and we trace all the execution paths until we find the sink, and we also have this concept of a sanitizer.

D

So it's not always bad that your sensitive data basically ends up in the sync.

D

Basically, it is not bad when there is some processing done ahead of time, which removes the potential sensitive data, so in other words there is this. Another concept of a sanitizer and the sanitizer is is uh is invoked just before we reached the sync.

D

Basically, this is considered a normal code flow and we do not raise any any concerns about that. So, um probably as you're thinking about this uh there is, there are two parts to this. One part is the uh analyzer itself and sort of the the uh sort of graph processing, et cetera, et cetera, but you're, probably already thinking in terms of yes.

D

This is this requires a lot of domain knowledge like, for example, uh I could be the best static analysis developer in the world, but uh but unless I knew uh the types that are that may contain sensitive information in kubernetes, my analyzer will not find anything.

D

In other words, there is certainly uh significant input that is required from the community to describe the inputs that may contain potentially sensitive information, and the same goes for the sanitizers, and the same goes for the log which, which pretty much uh kind of explains the main architectural um characteristic of the of the analyzer, that it is very config driven. So it receives a config that defines all the sources all the sanitaries on the logs and then when it analyzes the graph, it makes the recommendations about potentially risky risk operations.

D

So there is another aspect of of the there's another interesting complexity about this about. This analysis is something we call propagators so like pretty much the same picture here you see the input, but then we took the input and we converted to the string. So basically, let's say imagine that the input is the kubernetes secret and we call the string method on it and now the variable s is a string that contains potentially contents contains credentials.

D

uh Now, at this point in time, we now have several inputs that contain confidential information and we need to make sure that we properly handle all of those all of those sources.

D

So I think this is pretty much all the all the theory you you. We need to understand how the static analysis works uh with that. I will pass it over to patrick and uh he will do a short demo of the of the tool we've been working on.

E

Yeah, hello, everybody uh but yeah. You can hit me with co-host as well.

D

Yeah, I'm stopping.

E

I um to I would also like to share my screen.

E

Here we go so share screen here we go all right, cool, so yeah uh pretty much exactly like what alex was saying. um We did end up finding uh one live in the wild.

D

E

Example here we are in the cubelet token manager um and, like primarily, I always think of these static analysis tools as things that are like helping carry developer mental load, because here we have like just a token request, refresh uh and there's a very simple case of if the requested expiry is nil. That's like an invalid request, so we're gonna just throw this into a log. Hey this request.

E

Wasn't valid figure out what's going wrong, um but in this particular case we have uh this token request is defined over here in the top right as containing this status, which is optional, but uh that request itself might contain this token. um So this is one of the uh instances where it's like yeah. This is maybe something we should have explicitly removed a token. We don't know if it's there or not.

E

This may have never actually become an issue, but uh here back at the end of april, we ended up uh rewriting this to explicitly zero out that token before it was sent to logs. um But this is the sort of thing static. Analysis is great at checking or catching so here we can use our vet tool and see that it's telling us that right here, we've got this issue um and the big one uh like alex was saying is catching propagation.

E

So here if we were to specifically say uh cast this thing to a string first and log that um you know make sure that we're still catching things like this as well, uh so we do have, oh sorry, go for it.

A

uh So we're actually out of time.

E

Oh okay, uh uh well so like the two-minute wrap-up is the the interaction with the existing cap uh would be this kind of uh the dynamic portion of analysis inverts the whole uh where, instead of trying to find where input can possibly reach, we are looking at the point of logging. uh What input is coming in and we are able to integrate our sanitizers into that to identify data's coming in probably using reflection, um but yeah it's a hard problem uh and I think best results you should use both.

E

Certainly knowing what sorts of things to look for in static analysis depends fundamentally on what developers are doing and without any signal of what static analysis like false negatives. You are allowing through there's not really a great way to iterate on that without having something in the dynamic. On the runtime side saying this is what you're missing um so yeah. That's my two cents.

A

I I think we should probably talk more about this. um I would also like merrick to be here because america has just done the the static analysis bit for instrumentation recently, so uh like yeah, I I think it sounds reasonable to me, um but we are out of time. I am so sorry and I don't know alarm's going off um so um hey.

A

Should we just put you first on the agenda for next time or.

D

That would be great.

E

D

A

Yeah: okay: let's do that? Okay, let's see that okay uh well stay safe! Everyone! Sorry for going over on time.

B

All right thanks, everyone thank.

D

B

Time, foreign.