Kubernetes SIG Instrumentation, 25 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20200625

Description

SIG Instrumentation Meeting June 25th 2020

A

Okay, today is sick, instrumentation meeting of uh june 25th, um frederick, you were saying.

B

I was just gonna ask uh david if he could give us a quick um follow-up from the discussion that we had last time where we he was gonna, go to sig note and ask about the additional metadata in log paths. I believe yep. So you.

C

Can hear me right see I brought it up. There were two I have to refresh myself I'm reading my.

C

C

uh So there are two concerns that came up, which I think were already were the reasons that they didn't do this in the initial version of the cri log format. But one being that there's there's some limitations with how much stuff you can stick in a file path, and so um there were concerns that we can't just. We may not be able to add it directly to the path for the file. The second concern being scalability just because we would have to watch namespace objects from all the nodes so yeah. That was the feedback.

C

It probably would require a cap and some experimentation um to figure out. If the former is a concern and then you'd whoever did, it would probably have to work with six scalability to make sure that any new watches that we're adding on namespaces don't cause any regressions.

B

So I'm not sure we're actually the right, like group to discuss this necessarily, but we have han from um aki machinery like. I wonder how this is not a problem with basically any any resource right like anything, that's name-spaced, uh the namespace may be deleted and critically recreated or, like maybe there's just some external thing that doesn't necessarily watch right um and uh just it could be an entirely new namespace right like could I it seems it seems unreasonable to do this, but like and an obvious solution to this could be uh like.

B

Let's add it, to object, meta right like I'm, I'm probably it's probably not gonna happen, because that would probably be too huge of an impact to literally any object in kubernetes. But, like am I mistaken that this seems to be a problem for literally anything wait.

A

uh What is the problem that you're describing.

B

So the the issue describes how um log metadata essentially lacks, um not just it has the namespace name, but it doesn't have the namespace uid, and so there can be certain situations where, like recreating. This namespace may actually cause like a separate namespace to exist, but it happens to collide with the name right.

B

The uid is different, but.

A

Yeah, but that thing is kind of broken right now. Anyway, right I mean, like I've, definitely seen orphaned objects under namespace, where basically in order to clean them up like you've got garbage collection, failed or whatever. You basically have to recreate the namespace, so you're going to have a different uid, but it's effectively the same.

B

I think right, but but the point was like people want to know that it's the new namespace.

A

I mean yeah, so we don't really have this problem in api machinery because we don't really ever care about the uid of the name space like I, I failed to see in which circumstance that matters.

C

I think the guarantee we give is that a namespace name is unique at any given point in time, but not across time, and so right now for metrics, for example, we stick with namespace name, because any particular metric is.

D

C

One namespace right, but if you.

C

But I guess that the issue is that for log storage, for whatever reason like some people, have problems with or want to make sure that a namespace can't be recreated and then um the same information is queryable. I guess.

A

Well, it's broken, then, because that assumption is actually incorrect, like people actually do create this name space to do garbage collection properly, sometimes.

E

So but you cannot like delete namespace without breaking like the light lifetime of pods. You need to delete, delete all the objects in namespace.

A

Well, I mean you have garbage collection, but sometimes garbage collection fails or it gets stuck for some reason. You have orphan objects under name space and it becomes like impossible to clean them up unless you actually recreate the namespace or so.

E

Could could the pods migrate between namespaces? What I mean here is garbage collected, didn't work and someone recreated the namespace with the same name. So I'm asking, are we worried about the case when using the name podlocks using the name of the namespace? Some somehow would not belong to the same namespace that uh after it was deleted.

A

I mean namespace is a little bit. I I'd have to say namespace is I I think it's a little bit buggy right now, so.

A

I think that this request seems to assume things about name space which might not be universally true for people.

B

I think a perfectly valid answer can be here just to ask why this is a necessity.

B

Yeah, it doesn't seem to me, like rereading the issue that that was actually clarified.

A

Yeah: okay, let's just clarify that on an issue, but we we do have an agenda uh yeah with uh was it alex and patrick.

D

Yeah we we we're here- uh this is alex and uh patrick is also. I see him uh online. um I think, last time, last time we uh attempted to present sort of the background. uh We ran out a little bit of time, uh so I just kind of want to throw it uh to the folks on on the meeting.

D

uh If uh the background is necessary, like explaining static analysis versus dynamic analysis and why we need both- or we should just kind of like happy to answer some questions that you may have related to the cap like we are flexible. So let us know how you would like us to proceed.

A

Well, just just for context, I mean uh david and I met with alex offline. I think a couple weeks ago and he clarified a couple things for us, which I I don't think we had fully understood, um and so I I thought it was kind of important that it was mentioned here so that other people are also on the same page, about.

B

It yeah, if you could just reiterate on that that would be useful.

D

Okay, so presentation like uh like 10 introduction into this topic: okay, perfect, uh huh can you make me a uh presenter somehow.

F

D

All right, can you guys uh see the the deck.

A

I just see a black screen.

D

Oh, that's not good.

D

B

D

B

Can't see it oh.

D

Well, that's good um all right, so just a little bit of a background, so uh I'm my name is uh alex chernicholski and uh with me here I also have patrick romberg and uh we are both on gt security team and in the last year, or so, we've been working on uh developing a solution for uh developing tooling to help developers um have developers catch early mistakes they make with respect to logging, potentially sensitive data to uh to you know to logs, um and in this context we basically thought that it would be a good idea to share the tooling and sort of the expertise with the broader community kubernetes community and uh in a recent fairly recently we open source the static analysis tool that basically, is the same tooling we use internally in ing key and how this relates to how this relates to the cap.

D

So the kia proposes to extend the k log in order to analyze the objects sent to the uh to send to the keylogging library and find potential potential credential, let's use the term credentials, but it could be something else that we don't want to allow, and there were a couple of questions. There were several good questions on the on the cap in the discussion about.

D

um Why can't we just use static analysis like like, obviously it's desirable to minimize changes to the to such components like k, log and like what, if we could just scan offline and find all the problems and sort of the short version of this is based on our experience.

D

Based on our experience, we found that at the moment at least the best approach is actually to use both, uh because that provides the best coverage. There are some certain edge cases uh within uh within static analysis, tooling, that the educators, which are difficult to catch uh with the static analysis uh and, on the contrary, like dynamic analysis like at the time of the runtime within k log, are, you know better better to detect things, so that is kind of the tldr and the rest.

D

I'm just kind of want to give you a little bit of a background about the static analysis. um I don't know if we will do the demo um just just to save time. I think the damn is not particularly exciting because it just basically analyzes the source code and finds the line where we send something to log in it prints the line where we found that error message. Sorry, we found that occurrence.

D

um So so what is like precisely? What is a static analysis and what is it exactly? uh How does it work so? The type of static analysis that we are using is called the team propagation analysis, and this is a type of dynamic analysis where, with the flow of the data um potential execution flows of the data is analyzed and we detect instances where user inputs exit the boundary of a program without being first sanitized.

D

So, in other words like imagine like a kubernetes secret um or, for example, token review object, which contains sensitive information and somewhere, it is initiated and it flows through the kubernetes code base, and eventually it may uh hit something. What we call a sync, a sync is something that takes the data and takes it outside the boundary of the process. It could be a log, it could be http call, it could be just writing to disk.

D

um So so the same propagation analysis is actually this is a this is its function is to analyze those sensitive inputs find the places where they exit the boundary of the process and warn the developer uh about the possibility that this may be a unsafe operation.

D

um Interestingly enough, this, this form of static analysis started out uh in the you know, started out to fight the cross-site scripting attacks, uh where you know malicious users input potentially like malicious javascript into the web form, and then that gets saved to the database and later on it can, you know, do some bad things uh when users load those pages back, um so this is kind of the same principle.

D

We basically assume that the something is stated and we'll look for uh instances where uh it is sent to the uh to the sink, so the concept of a sanitizer which is important again. uh So, uh ideally, you want to have a library within the within the I think, within the kubernetes code base.

D

Ideally, we want to have a library that knows how to sanitize things so like, for example, when it receives an object, it knows how to make it safe for logging, um and I think essentially, this is what this cab proposes, that k log will become this sort of library. It sort of combines.

D

You know, combines two functionalities, it it logs, but it also sanitizes, uh but obviously, as part of the implementation will actually separate those two functionalities. uh But from the simplicity point of view, you can think of k log becoming both sanitizer and logger and, of course, from static analysis. Point of view. If we detect that the data is sanitized before it exits uh the boundary of the process, then that's fine. We don't have to raise any any. We don't have to raise any concerns there.

D

uh So there's also this issue of propagators, which is a very tricky. This is where things become quite tricky. uh Think of, uh for example, um like, for example, grpc. So like rpt um or product product, you know product uh proper web copying like you, take something and you convert it into a different format like you. Take, for example, a string, a secret object and you convert it to a string so that now the string is also tainted right.

D

So you need to kind of worry about those things as well, but this is just like a background that explains that there are some difficulties in static analysis as well, because you need to it's not always the the data type of the potentially sensitive object may change during its life cycle.

D

uh So we'll skip the demo. uh We will talk about it last time, but uh I just want to kind of to explain what is dynamic analysis so conceptually dynamic analysis in the context of pierly very similar to thing propagation analysis.

D

I think there's some some noise coming from uh one of the one of that ngs. If you could just mute yourself, that would be awesome.

D

um The main difference here is that the analyzer is embedded in the sync itself, so this is essentially what the cap is proposing to embed some sanitization logic into the uh into the key log.

D

Of course, this is uh operating at runtime, dynamic, analyzers, leverage, reflection and using, and potentially you also use the call stack analysis to identify potential sources.

D

So why do we need both uh so identifying painted data is hard. It's a hard problem. It's still under uh subject of active research.

D

So therefore, we recommend for best results uh to use a defense and depth approach and use both static and dynamic analysis. So, essentially, what we are saying here is that this is the first cap that we are proposing to solve to to to mitigate the risk of logging credentials, and we will be following up with another cap where we will propose the use of a static analysis tool as part of the pre-submit uh pre-submit pipeline within kubernetes, um so, but just to keep sort of kept actionable.

D

We didn't want to combine those two tools in a single cab, but I just want to kind of give you a background that we are certainly planning to introduce both tools into the kubernetes ecosystem.

A

Yeah so for me that was like the key thing that was missing from it was because every uh the static analysis question is kind of, I guess, the obvious one that people have um yeah and then and then basically, when I understood that you had intended to do the static analysis thing and that you in fact already had code for it. I thought that was particularly relevant, especially considering that merrick uh just implemented the aesthetic analysis piece uh for instrumentation, and so I could probably give a couple pointers.

D

Yeah we're already talking to we're already talking to american, uh we and- and he was very gracious and then provided a very detailed list of things we need to think about, and so so yeah this is. uh This is what patrick will be working in the coming month. um But hopefully this gives you some idea that it's it's not one versus the other. It's really it's really both that we need to implement for the best results, uh and this is just the first portion of of this effort.

C

There's a question in the chat: are you proposing dynamic analysis be run constantly on a production system, they're concerned about the overhead cost.

D

Right right, uh it is uh so so the so the big yeah, so the dynamic analysis by definition, runs at at execution time and it runs constantly and essentially every time there is a there is a call to k log uh some logic. Some some logic will run in order to determine whether or not it is safe to log this piece of data.

D

There are many optimization techniques that are possible, which we decided not to include in the cap.

D

The the the new ideas, though there is to rely on caching, so the sort of the general idea there is that 99 of the calls will not have uh will not have credentials to them. So basically, once we analyze the call call site- and we determine that it is not producing that it is not producing any uh credentials. Then we can probably catch that fact for some time, and that will be the main uh idea behind optimization um yeah.

D

But this is something we will probably best discuss like as part of the implementation on the cl and we'll provide some benchmarks, et cetera, et cetera. But it's certainly on the top of our mind uh with respect to how how to minimize the performance, overhead.

F

Yeah, just to make clear there will be a switch uh which will be which users will be uh able to use to switch on this functionality and as for now, we propose to to this functionalism to be switched off by default. So the the overhead should be really uh really minimal in case of functionality being switched off.

D

Yeah yeah, you exactly right bible. As a matter of fact, the way I look at dynamic analysis, uh eventually like once we mature, is that it's quite possible that, uh like that users will actually run this dynamic analysis in their testing and staging environments and basically issue alerts when something is found and fix the problems before they go into production and then basically running in production. It may or may not be even something that that that users, the users need, but I think they're not quite there.

D

Yet uh I think uh not everybody at that maturity level to you know to to set up alerts and react to them in testing and staging pieces.

A

uh Do you guys have? Are you guys planning on doing some perf testing too? Just in case I mean like I, I assume that, like call site, caching will work most of the time, but there are like pretty some edgy scenarios where you can't really catch the call site right.

D

Right right, absolutely because yeah we will we, we have some ideas about what performance overhead uh this this causes.

D

I think I guess my question is: what do you think would be an acceptable uh overhead and- and I think we will just keep it in mind as we implement this, and uh I mean I guess zero probably like like. Would you know, 20 30, millisecond overhead be sufficient like be acceptable like what? What do we think would be something that a community would accept.

D

B

Think this is probably not something that we can answer, but this would be something that scalability needs to answer.

D

I'm sorry who wouldn't you answer that.

A

uh Voidchuck, you should talk to boycott okay, yes sure yeah sure, uh but uh the thing on so on instrumentation side that the thing that is particularly relevant is the api surface, which you are expanding in k-log, right, right and particularly uh the filter methods or whatever, which I think is uh actually just generically, probably very useful for us, because that basically gives us an injection mechanism right. It's it's almost like a like layered handlers or something.

D

Exactly exactly yeah, I I yeah, I agree. I think it will be useful feature uh for potentially other things as well, uh but yeah, I think performance will come, become very important and we fully aware of that- and we are thinking very hard about how to minimize this. The best we can uh we will produce some numbers.

D

I know pavel already did some like experiments like instrumenting uh kellogg and looking at what potential overhead that made me you know produce, but I think I think it will be probably best to discuss us over cls and benchmarks like like what's happening there.

A

I personally have no objection to expanding the k-log interface to add this filter mechanism, because adding layered handlers for logging could be generically useful. I I don't know how everyone else feels about that. I think that's, probably the most significant change from an instrumentation perspective.

C

Just a quick time check, I think we've got like four minutes left and one other agenda item.

A

But uh does anyone well.

B

Against this I do think, but that that's also probably because I just haven't, spent enough time with k-log myself. So um I would very much like us to get a review on this from sally and tim hawkin, um because they they've been quite uh in the weeds of the um kellogg b2 as well as well as mark, obviously um so like. I would like to delegate that, essentially to them and if they're comfortable with this, I'm perfectly happy with moving forward with this.

A

American america is like working.

B

Yeah, absolutely that's yeah! That's what I added.

E

Yeah yeah, definitely, I think their feedback would be useful. I think there's some ideas hanging in tim's head to like deprec like stop running or on k-lok, because it's starting to be a wrapper for our, like customized implementations of login, like the current structure logging cap, introduces a pluggable system for logging for json and like this is the direction. This is also the interface that they designed so I'll be very interested in their feedback. How how this will like uh how this mechanism would be integrated with it?

E

But I think this is implementation detail and I think the overall change makes.

A

Would it make sense that, wouldn't it make sense more to implement the static analysis piece first, because since we're actually actively working on replacing k log with the structured logging thing, uh it would almost make more sense to do the static analysis piece and then to integrate directly with structured logging as a second step, if that's available right like because that could be available.

E

Yeah, so there are two like layers currently.

D

Yeah sorry go ahead, go ahead.

E

Yeah so currently like there are two layers k. Log currently is its own implementation that also grabs the lower level interface that we want to like, go deeper and start developing and at some point, maybe reverse those so k lock, would not be a default implementation. It would be one of deployment. Implementations uh yeah, like whatever like k, look already like it, should be pretty easy to already like integrate. One of those.

E

It just depends how much work we want to do like now or postpone to future, depending on how much people want to work on removing k-lock, which I don't think there is like deprecation in kubernetes, doesn't go too fast. Let's say.

D

Just I know we have one minute. I just want to throw one thought why uh the the sequence you so you can certainly do it the other way around. We can start with the static analysis and the dynamic analysis later, but what we found from our experience that static, dynamic analysis uh was finding thing. Basically, the the challenge with static analysis is knowing the types that contain uh credentials.

D

What we found is by using dynamic analysis. We were learning about the system about the types that actually produce credentials and that would feed the basically it's a kind of a feedback loop. We find something with a dynamic analysis and we extend the configuration of the static analysis to know hey. This type here can also contain pi.

D

So so, for that reason sometimes it's better. It may be better to do the dynamic analysis first and then you know ensure that static analysis has the appropriate configuration.

A

Well, from a safety perspective, static analysis has no performance cost, so you can introduce a static analysis that does very little and no one's going to object to it because it runs in pre-commit, uh correct. It's like like. There's no blockers like you just and it's super easy to do like it. Just you just yeah, it's like it's. It's like yeah.

D

Yeah yeah patrick already running it on his workstation. We already. Actually we actually. I already found one bug uh and and fix this, so so we're already doing this, uh it's absolutely absolutely right. It's something that, like we can just do it like. No, we don't even need to ask like it would be nice to do it every submit, but we are running it on a regular basis and we find something we reported um yeah.

D

I I think uh I I think, but do you think it has a relevance on this camp like would you do you think it changes anything in the what this cab proposes.

A

Yes, because mostly, I think everyone was wondering about the static analysis about during this cap so like. If the second analysis question was answered, then basically you'd be like oh yeah. We already have the static analysis, we're supplementing this and then explain the dynamic interactions between the two, as opposed to we're doing this dynamic analysis thing first, but it's like there's a performance class. We um so I mean yeah, that's how I think of things and possibly other people will may go through that cycle of god.

A

But I I mean, if you're doing both it's like great yeah, I I think right sounds great.

D

Right, it's a good question. You know we actually haven't thought deeply about this, but just somehow historically that's how we we started with dynamic analysis and then we we introduced static, analysis and but yeah. I don't think there is a.

D

There is a car dependency on the order here.

C

I think we're out of time um thanks everyone, uh and especially the presenters today and we'll see everyone next week.

A

All right, yeah, thank you and yeah. The the proposal looks uh good to me and uh you know I'm I'm, okay with it. If yeah the the luck things or whatever so.

D

Well, thank you. Thank you so much for for your time. It was it was. It was very nice of talking to you guys, so hopefully we can we can. We can make this make forward on this uh really appreciate your feedback thanks.

A

Thanks everyone stay safe.

B

Thank you thanks. Everyone bye.