GitLab Govern: Security Policies, 14 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Protect:Container Security group discussion 2021-07-14

Description

https://docs.google.com/document/d/1qCwZfoo1A-FihE2ifzd4ZT_Mpz-xFzZvAPJ7pJvWCEY (internal document)

A

Welcome to our uh weekly container security group meeting um so today, oh, we actually have a couple other agenda items so we've got alan has a demo for cluster image scanning, which is great, um really excited. We've got the first version of that out in alpha. So if you have time check it out, we've got our documentation on the website too.

A

um So, mostly for today, I wanted to just have a brainstorming open-ended discussion, see if we can come up with any good ideas for a good solution on how to improve the container scanning matching, it's a really hard problem to solve, and fundamentally the problem is that right now, the way it works in gitlab. The findings are properly deduplicated.

A

If your images follow the naming strategy that we expect and if you also have the branching strategy that we expect, but some users have different branching strategies and they also have different naming conventions for their images, and the net result of that is that the findings are not properly duplicated between the other branches and the master branch.

A

So we're looking to provide some more flexibility for users who have different approaches there and again it's a tricky problem to solve, so just wanted to kick it off with with that and see if we can come up with any good ideas on how to solve that and I'm sorry, it's really late my time so I'm just coming in here I'm trying to get my head into what we're doing.

B

I know the feeling. That's me, that's me at uh 6 30 in the morning first thing in the morning.

A

So I know one of the ideas was to let users put in you know, basically a regular expression or otherwise customize. How that pattern matching would work, but there are some downsides to that that thiago, you called out around.

B

Even adam before me, actually mostly about the the instability of of the fingerprint they'll, be generated right. So if somebody changes that, then all of a sudden you get as many new findings as um as as they were previously handled.

B

That was the main downside. So then we discussed uh locking that making that, like, okay once you've set this, you can't undo it um after that. I I don't think I've written this anywhere, but there's been uh a magic set of console, commands to to kind of declare bankruptcy and start again, yeah that I think um could come in handy here. So if you find yourself in a situation where you know what I I need to change and reset things, then maybe that'll be a an interesting thing to consider.

B

um I know I know that matt has been waiting to see for a while waiting to see a more solid use case for this, because every now and then it's it's really the one-off request before making that part of the product.

B

A

The point that actually could I agree, I think that's a great idea. um You know if you were to change that pattern, match that regular expression. For any reason, you might also want to just wipe out all of the previous container scanning findings and start clean at that point. Yeah.

B

What else uh to be considered um so, I think sashi, you can keep me honest here, but I think every every solution that we thought about has a dependency on tracking vulnerabilities in the known default branch.

B

Yeah and and that's something that's not supported at the moment- um it'll be it'll, be a something for threatening sites to to look at and and put in. So, as you said,.

A

That could you explain that a little bit more, I I don't fully understand why that's a dependency.

B

So there are two: there are two different things here in the in this in this issue that I see one is: what do you use as a fingerprint to compare and then because the image name has the name of the branch that then becomes part of the it can become part of the fingerprint, which means, if you're, using a different branching scheme or release scheme.

B

Now you might be getting different findings depending on what your? What what you call that what that branch, the other problem there is the is the is the compare base right so imagine, you've got a.

B

A customer that doesn't deploy from master or maybe they deploy from master to a production environment and they deploy from stage into a staging environment and they've got a workflow that tells them hey. You've got to do staging then master. So then what happens in these situations? Is that not only want to track a real estate master? You want to track them in staging as well and when you're first merging you don't want to compare. So you say you create a feature branch to implement something.

B

You don't want to compare the results of that scan with master. You want to first compare them to staging, because you know you might have fixed something you're, adding stage you don't want to keep coming up with that. So then the problem there is, if staging is not the default branch.

B

Now you can't. You can't do that comparison or if stage is the default branch and master reason. Now you can't compare to the master branch, so you can only compare to one target.

A

So with this, this approach, I'm not trying to solve for all of those different. um You know all of those different workflow scenarios, because I, I think that's a bigger problem than what we've got here and.

C

A

You know the threat insights group is more suited to solve anyway for everyone whenever that does come up on the roadmap, but really I'm looking at customers even suppose, you're still just following the default flow of merging into your default branch and releasing off of your default branch. If your naming strategy for your images differs from what we expect um like, if you have the branch name and the tag, then it throws it off so, like then,.

B

You're not supported basically.

A

Right so I think you know we don't need to solve everything with this iteration, but even just like the example in that proposal, description says: you know if you've got a master branch and a feature branch here- and you know the branch name is in the in- is in the tag. It's not going to properly uh correct.

B

I think I think this helps um in fact you might have mentioned that, and I forgot and I'm still thinking before you you mentioned that, so um that that does reduce the complexity a little bit.

A

Yeah so I know I've talked about different branching strategies and that's even how I kicked off. This discussion was mentioning different branching strategies, but um yeah. We. We definitely should d scope this to just focus on the use case of different image names. But still you know we don't want to solve that broader problem yet, but that's outside of the scope of this epic, let's just focus on when the user is merging from a branch into the master branch. How can we make sure that things are always properly deduplicated, regardless of their image?

A

Naming strategy might be a better way to word. It.

B

B

So then sashi um going off on the sorry I'm on the epic I want to be on the on the latest spike.

B

Because that uh I think that issue here is showing the epic, let me link to the spike.

D

B

So with this sashi does does that make some of the proposes a bit easier.

B

So I'm looking at.

B

B

Here right so basically sam's example.

E

This was just uh we used to document how it works currently. uh So in the proposal another comment down there, uh I was still working on the pmc or following that approach using the cs-based image. As uh you know, the base image against which the you know all the musicians entire games, uh it is still at the poc, I would say, since a lot of cases, it doesn't work and there are three performance issues as well in terms of reading the rallying of the seas, basically uh from the ncaa file.

E

uh That is quite a bit, uh not performant one. So but uh overall it works. Fine. The works that I mentioned. uh So what I did was the cs based image. It uh extracts that value only for the pipeline that runs against a non-default branch. So if it runs on a default branch, it would not uh come back against the sales based branch.

E

So let's say we have two branches, the master and the fixing branch. The feature branch pipeline would, you know, construct a fingerprint using the base image which it should be the image of the default branch.

E

So that way, the fingerprint generated would be same as what is in the default branch and will leave duplication. It would pick up and yeah it would be even bigger. So that worked fine, but there are a few questions on that. If we, if the customer needs to update the default, runs or the default image, then that use case might not work properly as they expect, because then the fingerprint will again change and the newly generated vulnerabilities would be duplicated again because the figure will change.

E

So that is one problem that I.

B

I was going to say so for that one that that might be at the point where say hey. If you want to change this you're better off, resetting yeah, resetting your findings and vulnerabilities.

E

Yeah that, if that would be fine to do that, uh if we change the base mix, variable then we'll just reset all the other days, and then it now gets started fresh.

A

Yeah I mean I don't expect it to be a common use case for customers to be changing. That variable right, like you, set it up for the project and it really should just be stable. It's going to be really a unusual scenario for them to change that and I think if we have requirements like you know, you have to delete all your vulnerabilities or you know I think we can put some documentation and some caveats around that sort of a scenario, because it really is an edge case scenario.

A

It's not like customers are going and changing their. You know base image, naming structure all the time.

B

E

Is this the one.

B

That you're talking about sashi.

E

Yeah, it's not bad.

E

So, uh and also tiago- and I had a about the same- we could either do using a ci variable or we could also reduce the configuration scheme screens for the continuous grading as we discussed, but it turns out that if we have multiple container spanning jobs running on the same pipeline, the confirmation screen will not work because this confirmation stream, we assume that there will be only one base image but, let's say a project that pushes two base images link. I think one is mentioned in the comments as well.

E

uh Let's say it pushes a server and a ui image. uh We need two basic images for those two. um I.

B

Think we do that, don't we we do that in the in the container scanning analyzer itself, we we publish a ubi image, uh actually, four images: we publish trivia and gripe and trivia and graph for ubi.

A

Yeah, so I think such is saying not identifying the container scanning image, but the image that's being scanned by.

B

Yeah, that's what I mean so when, when we build those images we scan them as part of the our own build process.

C

All right, we you're.

A

Saying we have four.

B

A

Yeah, so we have that exact scenario: yeah, I mean another item on our roadmap and maybe we combine these two efforts, but another item in our roadmap is to provide better support for multiple scanning, multiple images.

A

um I mean a big part of that is in the ui, because there's no way to filter by image, but maybe we extend it so that a single container scanning job can scan multiple images as part of just one job.

A

I don't know if that makes things easier or more complicated, but just a thought that you know in a note that that's on a roadmap anyway,.

B

One, I think, sashi- and I I don't know if we talked about this actually in this context, but we were talking about it's almost like uh you start pausing the docker file ourselves to understand what the what the layer build is right, because then, if you, if you're building four images or two images in your in your container and they both use the same base image so say, I use a ruby base image and then I use one.

B

One container builds one app and another container use a different app, but they both use the same base image. Any vulnerability on that base. Image will be on both uh on both generated images, which means they're the same once you fix it. You you fix it in both and docker understands that. If you look at the, if you do a docker info, you you see the layers in there and the and the manifest not the manifest. What is it called?

B

The the digest will show that they are the same layer, but we don't have those smarts we try to. We simply. We oversimplify it by just you know, building it based on the path.

B

So we did, we didn't explore it. We just figured that sounds very complicated. I don't know if you want to do that.

A

That's actually on our web map too. It's it's out. There ways it's more, you know further out there, but we would like to get some more smarts about the actual layer where the vulnerabilities are introduced.

A

Right now, container scanning is a very noisy scanning job, and so anything we can do to reduce that noise and consolidate vulnerabilities and reduce the overall work. If there's only one fix for the security team. You know really should only be one vulnerability in that case, and we should just notate within that vulnerability all the different areas that are affected by that same vulnerability, and so that way, using just the clutter and the sheer volume of things that they have to go through in triage.

B

um Something around that part with uh when I that I um proposed on the threatening sites um group was and and it's it wasn't, even that the original idea, because I think uh philippe um he had the same sort of idea with a different take, but basically facilitate the triage by by creating a reusable list of reusable triage that you can say, hey I've already looked at these cves and in the context of an application that runs in a certain way. These are all.

B

um False positives and an example can be um if you have an application that runs as a container, then anything that that relates to bluetooth.

B

It's not gonna affect you like you, you don't you, don't usually plug bluetooth things in in there right in the in a container.

B

So then, even if you have duplicate vulnerabilities coming up, even if we can't track that properly the file, it's almost like an allow list or denied list saying hey, never worry about this, so a couple ways that this could be done.

B

The philippines philippe's idea is to have an include for for uh for an list in the night list that you can just say: hey never worry about these, but those are cumbersome to create right, you've, gotta bash in cvs, and all that, although you could script it like you export things and anyway, and and what I was thinking was something that would take that would be sort of built into the workflow.

B

So as you're triaging things- and you dismiss it, you say oh and by the way add this to to this exception list here and then for each project could apply that list and say you know anytime, you scan things, apply this list here and then at at the end. You only get the stuff that uh you want and I'm mentioning this because because you, you made a helpful comment to me anyway. The objective is to reduce the number of to reduce the noise.

B

That could be a way to reduce the noise. I don't know it's probably more work than.

A

Yeah, there are some other ways that we can get some good smarts around. That too. All of that is further out on our roadmap than you know really the foreseeable future, but if there is some sort of overlapping solution that also addresses you know the main point of this epic, I'm definitely open to taking those on together. If that helps make this easier in some way,.

B

Yeah, I think the the best we got so far after after many days of research is. Is this base image idea and um did you think about uh how to solve the performance problem? Sashi, no pressure.

E

If you haven't, but so I have to uh purchase in my mind, one minute space already is introducing this new index table to reduce the db performance overload, because currently the location, fingerprint and the location data is is inserted as a json and filtering it using a field inside the json would be not so performant.

E

So as we discussed, we introduced this new table that stores the image name and the appropriate vulnerability occurrence id as an index table that would reduce the load to some extent and in terms of getting the value of the job variable or the the css based image variable.

E

Instead of reading it from the ciaml file. I was thinking if we could modify the schema of the report and from the container scanning and gcs project. We could inject that into the system unfold itself so that uh in the rails, when we read back the report json, we don't query the gitlab crm file. Instead, we just go to the report file and get the base image.

B

And how does the analyzer know what the base image should be.

E

So we have this variable independent, so we could read the variable from the.

B

E

So that reads it and then injects it into the report inside the location, object and the the brains that parses the report uh instead of reading from the output. It just reads from this snap.

B

The the infamous store report service, I think.

B

Which is undergoing a 13-point uh refactor, starting like this week or next week,.

E

um Is it related to the vulnerability or introducing this new technology.

E

Because so I was thinking if we, if that's going to happen, maybe we could uh take it back, connect yeah, so use the same for this as well. Maybe.

B

E

So it's a common issue.

B

So sam would do you think something something like this would be too cumbersome um for uses.

A

I actually think that's a great idea if we can work out. You know the challenges that we have with. It sounds like there's the performance challenge, but we have an idea on how to solve that.

A

There's the challenge of how do we handle when users have multiple container scanning jobs running in the same pipeline?

A

Could we just solve that by letting them specify a different cs-based image variable for each job? Would that solve it? That's.

E

The idea yeah that looks like.

A

um Are there any other major challenges that are not resolved if we go with those two solutions.

B

No, nothing, nothing obvious obvious to me just a matter of um breaking it down and and estimating refining and estimating okay and usually sometimes things pop up when you're doing that.

B

But I can't I can't see uh in terms of timing, though, um by the time by the time threading sites is done, doing the store report service refactor.

B

It will make it easier to implement this, and we probably wouldn't want to do it before anyway, not not because not only because it'll be harder, but because it's you know, it'll be through almost like throw away work. Somebody's gonna have to double up on that, and the second part is what sashi was saying: the the the index table. So the idea behind the index table is there's a very large table.

B

That's hard to search because there's some tricky joins the the index table or the lookup lookup table is relies on some triggers whenever things are instead or updated, it updates that lookup table. So then, when you're generating reports and doing comparisons, you're reading off that easy to query table and then to display things, you go back to the to the source of truth, which are the the occurrences tables.

B

uh So that thing.

B

All has it's about it's about 10 points as well um and depending on what your thoughts are same on the roadmap. You could, you could offer some help. You know for people in this team to pick that up instead of waiting for them, but either way somebody we need to do that.

A

I mean if that work is already underway. I'm inclined to just wait. I.

B

The first one is the store report services on the way the the lookup table is not.

A

B

So I would say a wild guess would be probably.6 14.6 that would be available maybe.5 if you're lucky.

A

Can you explain the lookup tablework one more time.

B

Yes, so there's a huge table called vulnerability occurrences and in order to show reports um you need to you need to do some joints with all the tables that are not not that small either. So then, the general idea is instead of doing those joints and doing that that query, that's currently timing out actually for large projects. If you go to to the gitlab group, get labor group and go for a vulnerability report, you get some time out.

B

So, instead of clearing that big table, what you do is, as you write to to to the tables involved in the join, you update a separate table and then, when you're doing the query, you just query that smaller separate table.

B

That's at a high level. The idea behind it.

A

Right, okay, so that's the triggers so you're saying that would be need to be yeah.

D

We need to be done.

A

Before we could have a performance solution.

B

If you've been doing an amazing job with minutes by the way sam, let me let me uh let me just put some links here.

E

Sure I read the link to the issue of adults so.

D

um Dependencies.

E

E

D

A

Makes sense to have those triggers in place um would what I'm not understanding fully? I guess is what is our dependency on that work, for what we're trying to do? Are we trying to read from that?

A

Would we need to read from that vulnerability occurrences table in a way that we're not doing today.

B

So the the first challenge is that what what sashi described in terms of uh um deduplicating all the duplication work happens in this store uh report service, so that that's the thing that reads the json file passes. The the security report then goes and compares and duplicate things, so that service is, is a beast and it's hard to to work with. So we want to wait for the refactor.

B

The second one is this: one here is the it's called the vulnerability read, model ability read.

B

Which is tackling the performance issue that sash is talking about, so we would have the same issue. The the our issue is a little bit different. It's not so much that the table is big, but that doesn't help it's because to do the lookup that's been proposed in his solution. That's a sub string in a column. You can't index that because it's basically json in a column, so you'd need to read every row, pass the json and then search for what you're looking for in the json and an easier way to do.

B

That is just all right. If, if you need this part of the json just put that on that lookup table.

A

I see so that would be. One of the triggers that would be implemented is to store.

B

We would update that trigger to say, hey and by the way, when you update that also put this in, because we need to to search based on this.

A

Yeah, so I think it's reasonable for us to take that piece on. um You know we have a strong interest in having that trigger created, so I I don't mind helping out and creating that trigger. You know the trigger that we need to to solve for that performance problem.

A

I don't think we should get involved in the store report service. I think we should just wait for them to finish that refactor.

A

If that helps with the prioritization there, I know we're over time. Do you have time to stay on for.

B

A little bit, it's it's the middle of my day. I I should hope that I can be here way way longer than you said. Sounds good, don't make me look bad.

B

um So I've I've pasted two of those issues for the vulnerability reads: um I'll I'll have a chat to the threats in science engineers to see if it would make sense to do that before doing the refactor, because then we could paralyze that work right instead of doing one after the other we'll do them at the same.

D

Time, sons to contribute.

A

To how much do you have a sense tiago, for how much are we taking on by helping out with that performance issue the trigger you know, creating a trigger to store those.

B

It hasn't been finished, uh refining and it's kind of hard to look at, but I'm looking at six issues at least give it another average of two points. Probably 12 points there, so it would probably be I'd. Call that a a whole engineer dedicated to that for a whole milestone.

A

B

Yeah, it's not exactly cheap.

A

For what we need with you know, taking on just one or two of those issues.

B

I mean anything helps right, even if we, it just means we'll be waiting less, but uh we we probably wouldn't um we probably wouldn't be able to make this feature live. Without that performance work you could you could throw in a feature flag to say, hey this. This won't be turned on until that performance is resolved, and then you can just wait, but at some point you need to do it.

A

So I guess what I'm asking is: do we have to have all six of those issues? Yes, or can we just do one or two of them and get what we need out of them.

B

No, no, I it needs all of it. It's just broken down in in the small pieces, but until you do the last one you're not you're, not using all that you've done.

A

B

It's basically create a lookup table now create the trigo in one place, create the other trigger and then at the end of everything now start using the lookup table.

A

Okay, yeah, that makes sense and adding additional fields, is trivial to setting up the whole uh process. Yeah that makes sense.

A

um Okay, yeah thanks for taking the time to clarify that, for me, I'm just trying to get a sense for how much we might be taking on by by diving into all. Of that. I mean.

B

Like if, if in in your place, I probably wait and if, if if you happen, if it happens to like once, it's all refined and ready to go, if you find that it's best to help, then jump in and I can coordinate between the two groups, but I I I don't see with without helping the threatening sides. I don't see us doing this much sooner than 13, 6, sorry, 14, 6.,.

A

That's reasonable is, are we blocked on all of the other work that has to be done, or can we you know, do all of the other work and and wait for that trigger to be put in place so that we can, you know, turn it on. It sounds like you were suggesting we could right. We should put this behind it.

B

I I we might be able to um sashi. What do you think do you think? Would there be a way to throw a feature flag there too, to disable that expensive query.

B

I would say, though, that we are blocked on the store report service.

A

Yeah, that makes that's kind of like. When do you think that will be complete or do you know.

B

I've been asked this question earlier today and I was really really reluctant to give a number, because the estimate is at 13, which means it's it's entering the the field of fiction, but I I was optimist optimistic to get uh to get it in 14, 3.

A

Okay, so we so yeah we wouldn't even start. So I'm glad we have a plan here. I think we'll probably want to just document this really well um and then put it on the shelf and then likely it makes sense to pick this up. You know, after that, refactor is done in 143 or 14 4 yep, all the work that we can and then we'll see. You know where the work is at at that point for the the trigger work you know maybe we'll pitch in and help out some there.

A

Maybe we won't, maybe it'll be done by then. You know yes,.

C

A

Never evaluate it in 14, 3 or 4., so I'm going.

C

To leave it there, I.

A

I don't talk about this much, but my priorities are a little bit flexible.

A

I've got my priority order, but if there's a good reason to not follow the priority order, then you know yeah.

C

A

Stay there as number two, but if we want to move down to number three or four or even pick up some of the other, you know actually start working on container host security, a little bit yeah really long time. um You know.

B

Show falco some love right, so so sashi just just to keep me honest. Do you agree with that that there's really not much we can do without touching the the store report, service or report store service.

E

You could still do it, but I'm still not sure how the refactor would end up, because we have this parcel, which is actually more logical. That's at least what we want. If the refactor doesn't that's the parser, I think it would be fine to start yeah. I'd still, maybe wait for the reflector or just to see what you.

B

Could, for example, work on the on the on the new variable on the base and not do anything with it right because you're going to have to update the schema anyway for the container security report, that's work that can be done sam. If, if.

C

You don't mind you.

B

Know, starting something that's going to be really long-winded because there will be a hiatus. Am I pronouncing that right.

A

uh We could, I would say, that's partially up to you on the engineering team. I don't have a super strong opinion there, but you know it's not like we're going to be delivering any value to customers by picking it up and then stopping and picking it up again. So I I don't know I guess I have a weak preference for just waiting, just putting the whole thing on pause and picking it up. Yeah.

B

Me too, and and we'll know how how the refactor is going, so we could, we could start hey. If it's looking like it's going to be done in a couple weeks, then we can start that right.

A

Yep, exactly okay, so um yes, sashi, are you if you're clear on the technical solution? Why don't we document it really?

A

Well, um you know, let's just document it really well, let's refine it as much as we reasonably can refine it at this point, um with the assumption that you know we're starting this after the store report service is done, refactored being refactored, I mean it might be a little bit hard to fully refine, since that's not done so, we don't know what we're dealing with, but let's just document it as well as we can put it on the shelf and uh pick up.

A

You know either priority number three or put more work on priority number one or move over to container host security.

E

Yeah I'll knock with all this in the spike or maybe they'll, be concerned.

A

Epic description.

B

Yeah, I was going to suggest if you, if you could move it to the epic, then as uh as the as the next step, then after that we could uh look at the planning breakdown for it and write implementation issues. Basically, while this thing is in the hot cache, you know l1 l2, whatever one is the the fast one write it off there. So we don't have to remember all this three months from now.

A

Cool all right, I'm glad we got to good. I feel like we got to a good place here.

B

Yep, I think we got something workable.

A

Sashi, did you have anything else? You wanted to discuss questions comments, nope.

E

uh Unclear answer: we did this.

A

Okay, great yeah thanks. Everyone have.

E

C

Good evening.

C