GitLab Secure Stage, 20 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discussion on why we need stable Vulnerability Finding IDs

Description

Related issues:
Add a new `id` property to replace the legacy `cve` in JSON common security report format: https://gitlab.com/gitlab-org/gitlab/issues/36777
Change vulnerability feedback identification: https://gitlab.com/gitlab-org/gitlab/-/issues/205489

A

A

Hi everyone, my name's Lucas Charles, we're all that's secured and today we're going to be talking about stable owner ability, identifiers in our reports and potentially beyond. So this conversation came out of a recent issue that I would actually switch the order of these two related issues in this agenda and say adding a new ID property came first, but after some discussions with Daniel we're trying to figure out if we're actually going to be solving the same problem with these two things. So I guess running down into these questions.

A

First, let's start from the beginning, which is why we change the algorithm for generating the compare key. So who wants to take that one.

B

July up to you, okay, try to make this quick. We have 25 minutes most, so the CBI probably has to be deprecated.

B

It's a legacy field that contains various data depending on the analyzer, which is not consistent, not predictable and causing a lot of issues, because people think it's City, where it's not so, as we want to facilitate things for subpart integrators and isn't for us to be things up, we want to replay that with a proto name field that will allow us to so one thing which is making sure we can uniquely identify a unique occurrence on unique, finding reported by a scanner in a given report.

B

This not needs to be stable across different executions. This is addressed by other ways, and this has to be unique so that two different findings could be separately identify when we are generating remediations.

B

The the initial purpose of that robbery is that we can have a remediations array that can allow you to map with a given finding so that the the back end. When you will process the report will be able to merge them together.

B

Now the colonel cv probably is used for different purposes, so, on top of being a peg lee and and with mix with values, it's also used for two different things, which is using being matching the remediation with the finding, which is what we are trying to replace with this new ID property and the second one, it's years to generate the project, fingerprint value which is leveraged by all the variability feedbacks to do the matching limb finding anything back. So we have two separate issues.

B

The first one is to add a new ID to replace the legacy CDA when it comes to identifying uniquely finding for the primary purpose of using it in remediations, and then we have a second issue which is migrating. The feedback does that in deliberation in most CVG, and with these two steps being achieved, we now can remove the CVE property because we will no longer use it for any purposes and talking about CVE, it's been a long time than the real cv is part of the identifiers array.

B

So that answer your question. I I.

C

Think, maybe that's really good background, I think more. So what I'm curious about, in particular with this question, is the previous algorithm hashed together a couple fields. Was that not unique enough within the context of us of a report or a pipeline run.

B

The problem is we we're needing some stability in this property so that from different Python execution, we are generating the same CV value because it was norwich in the project fingerprint, which is used to map a finding with a feedback, and we want to keep the feedback stable, for example, urine earth report to get vulnerabilities. You get findings with a given CD value. You create a feedback by creating an issue or merge request or dismissing that finding. Then you run another pipeline.

B

You still have the same finding, but if the CD value is different, then the view the project in your page will be different and the existing feedback won't be matching, won't be match with this finding. So you will lose the feedback you created in the previous pipeline. So this is why we were trying to generate that value based on as precise properties as possible. Like sticking often sticking a primary identifier is taking an ash of some stable properties like the line number, which is not as stable as we can think about it.

B

But this is why it's super crappy. It's because it's really depend on the analyzer, and it's not well done for quite some time, and this is why, when it comes to identifying a vulnerability on the race back end, we no longer use that property. We moved away from that and use. Instead, the report type primary identifier and the location fingerprint, which is degenerated from various location, sub properties depending on the repo type.

B

The goal is to do the same for the feedback Jay.

A

Jumped ahead, can we generate the CV using the exact same practice and then we'll be okay? You think the idea mean. Can we get if we didn't care about deprecation and if we just worried like we need to make the CVE better and change no code. Could we just create a hash of the price report site primary identifier and location and then we're good? This.

B

Is what I initially was thinking about and had a brainstorming session about this quite some time ago and the dancewear was we don't need it? We could need. We could do that, but we don't need to do that.

B

The problem comes with party integrators, because, if we want to do that, we need to expose to a third party integrators, how to generate that idea and depending on the system and the way they are generating their vulnerabilities, they might want to use a unique way of generating the ID and they might want to use an idea that they can track back later. To me, this is a different purpose and this needs to be a wave and/or field, specific field that we can put in the specific part of the report.

B

But the decision was to conduct to move forward quickly quickly. We need, to just address, add a specific field to address our specific name, which is making sure we have a unique ID for a given occurrence, given finding in the report to leverage that, with the remediation.

C

Okay, so so to clarify first off. Thank you. So much for digging into this I know we're kind of bringing this back up, but this has been really useful for me to just understand the codebase and how it all gets tied together with the rails, app and whatnot. um So I see this. This field, CDE I'm gonna talk about the old field, has had multiple jobs, and it seems that maybe what we're are we trying to make ID only do like one of those previous jobs and then change the way in that way.

C

Is that correct, okay, and so this new ID field really only needs to be unique within the context of a single report, and it won't be used for grouping occurrences into a vulnerability in the future. Okay that all makes sense. So just not as a suggestion, but just pushing on my understanding of the problem. Technically, could we just enumerate zero through one through, however many and just put an integer as a ID?

C

Oh because you have to take, you have to take an occurrence and it needs to have be a function that takes the occurrence and generates. The same is.

B

That why no the? If I, remember correctly, it's really the issue. We had this conversation with Fabian I. Think it's because this has to be unique within the context of not a scab but a given scanner, and this is why we want to initially to couple that with the scanner ID. And the reason is because we will aggregate that with other scans for same repo type, we might end up with adding vulnerability is different, really defining having the same ID, which is not something we want to deal with.

A

So my scan so assuming you have a pipeline that has two different jobs: it does AES Lintz scan and it does in secrets that generates two json reports in separate jobs that those are just clarify your language, that's two scans, okay, and so we actually do need to correlate across scans.

A

I'm not sure I'm, quite following III, guess I'm not quite following that, because a remediation, a remediation is always scoped within a report right. So if we use something like an array index of vulnerabilities I still don't really see why that wouldn't work, because.

B

We, when we process remediation it can be after we aggregated the data. I, don't think the remediation algorithm is tied to a given scan. Maybe this is something we can revisit in the future, but for now we generate multiple cheesin reports on the another side that might not know about the others. So we need to make sure that independently are capable of generating a unique ID, because they don't know about the others.

B

They don't know how we will aggregate things on the race back in and we don't know how we will liberate this remediation ID later. Maybe we will move the remediation stuff in a separate job so to avoid any future blocking in the way we architect the remediation workflow having a uniquely generated ID is the best approach to me, and this is why I was initially pushing for using a UID. But we have. Another issue is that we already have a UID column in the vulnerability occurrences table which is for a different purpose.

B

So it's a bit complicated.

C

So so again to clarify here previously, we said we need it to be unique within the context of a report, but actually the context is within an analyzer. It needs to be unique to an analyzer within within that context. Right is that correct, because you want to protect it, say: here's an ID and this point at a single occurrence within all the runs for this analyzer yeah.

B

Okay, so we were trying to couple to make a compound ID using the scatter ID and.

A

Because a word so because there's word before scanner and analyzer are almost always the same thing. Yeah.

B

It's just the program is quatre ID, but obviously a given analyzer is already running scan for the same scanner, and there are multiple things to consider when singing about this kind of idea is like the fact that today we have one analyzer and only reading one job for itself, but we could later run the same analyzer on two separate jobs for improving performance, concurrency or because we want.

B

For example, we are discussing scanning multiple docker images, we've containers Kenny, so we can have two jobs of continuous canning running at the same time on two different images. So this is the kind of thing we consider when thinking about the uniqueness of this ID, and this is why I come. You know. Uid is the best possible way because we are making sure it's always unique.

A

So right during this conversation, I keep I, keep thinking how confusing this would be for the future. If we don't just use the exact same uniqueness identifies that we use in rails mainly going back to generating this identifier using like we probably don't need it, but generating this from the report.

A

Type primary identifier and location at least means we're using the same identifier for vulnerabilities everywhere or search for findings everywhere, and it seems like that's going to reduce a lot of confusion down the road, so I just want to double check, but that's not the direction we want to go because it looks like it'd be a fairly simple code change from where Huggins MMR is right. Now.

B

The only drawback about this is, we need to expose Indian aligners and to subtly integrators how to generate the location imprint because the repo type is abused. The primary identifier is relatively obvious. It's the first time, the first time that you hire you put there, and we have explained in the documentation that this has to be something really unique and stable. For that specific finding and the last part is the most complicated one which is which property is from the location object.

B

Are we taking to generate this location in your print, so this is a compromise between taking as much precise properties as possible to make as me, to make this the most uniquely possible way to identify this finding and and to distinguish in the most finest graded way to different venerability is the best example. Is you have two vulnerabilities on the same line of code? The only difference is which columns in that file is impacted.

B

We are not able to distinguish these two cases today, because we don't go down to the line number, and this is a specific example to self, but we are different problem depending on the other repo type here, Tomas.

D

Why are we treating this as a public API contract.

E

D

E

D

Third-Party integrations, which tells me that we're treating this JSON format as a public form as a public API contract yeah. Why are wringing that into scope, rather than treating this as an internal data structure that we can then map whatever we're going to propose a third party, integrators use and decouple? These constructs are decoupled. These two issues.

B

Why I'm trying to explain what would be the impact to exploding to third party to use the specific properties? And this is the reason why we should do not go that route, because we don't want to expose that apology, because if today we say for container scanning, we don't use this way to generate the location trigger print, but instead we mix those different properties.

B

We will have to tell that to all the integrators and ask them to change the way they're ordering the repos, whereas if we just keep that internal, we just say hey here- are the property that you need absolutely to fill in the report and then, if we change the logic behind that, we can do that internally. And this is why, to me, the new idea approach is better and we don't need to expose that complex the identification logic and use this as the identifier for a magician proposed in the report. I.

A

Lost you but b-but, it would change like if e. If the my number changes and we're just hashing, the entire object, I think that's fine. It will change the identifier because the data has changed. So we just say the entire location matters keep it consistent and it will keep the identifiers consistent, yeah.

B

I mean for the purpose of irradiation, yeah, that's fine! The problem is, you might do some aggregation already at that report level. So, if you have to run our abilities today on the same line, you will have two varieties in the report file adding the same ID.

C

So it kind of seems like what would be clearest also for third-party integrators is to I think it's still not completely clear to me the use case for the identifier right, because there there are multiple problems that need to be solved with how we think through this one is you know, grouping and that's we've already stated: that's not what this is going to be for it was what CB is not for this just pushing on my understanding.

C

Just pushing on my understanding, so if we actually just used an enumeration, so just one two, three four and put a number, not UDID, not unique across the board, not a hash did that in a report and then in remediation we actually said pipeline ID or report ID, plus the in the ID, which is just an index. Basically, that's you know cash it into the report or persist it. If you will with that salt, would that be a solution? Just just trying to make sure I understand.

C

The use case is that this points to a specific occurrence for all the runs within this analyzer. It.

B

Could make you could even be a job ID, which would be more precise looking this is. This will be the most precise way that we can identifies the.

C

Problem is yeah very much so and sorry, so it's very much so an occurrence ID so like this. We found this this one time it points to this very this very moment in time, yep that makes sense, and it's not for it's, this kind of a vulnerability, or this specifically specific instance in the code, so really don't use this for grouping, because you'll just it'll just create one really free to nothing. One yep, okay, that.

B

Make sense and the reason other reason we don't want to explore. That is because this is something not something common. A lot of tolling in the our computers doesn't do that this, this complex identification logic and how did the algorithm not to recognize a marriage between execution is something really important for gitlab, because we are in a ton of feature on top of this reporting. Is that just the one short reporter will just tell us about what findings do we are found in your project?

B

Is we are constantly watching the status of your project capable of maintaining attached privilege to the attach metadata like feedback, and we are able to aggregate multiple different reports and she's, the same view, etc. So all of these features are built on top of simple reporting and so party integrators doesn't have this logic and we don't want to expose that to them and frozen to follow it, because we might change this in the future, and this is why not putting all this primary identifier locations, we bring print logic into the report and into this identifier.

B

Generation was decided in the first place, yeah.

C

And that that was surfaced yesterday, when I was talking to Lucas about the rails, app actually has all the information to be able to group appropriately. It can change its algorithm, etc. Us trying to impose that probably the wrong place to do that so I. That makes sense just to make sure I understand. With regards to the the use case for Auto remediation, do you like, in the case of UT, ID or UID, or like some randomly generated thing, and when is it gymnasium that does this? That can that can auto remediate?

C

How? How is it going to do the lookup in to find what it actually solved you know, do you not need the actual occurrence to be able to basically come like we were doing where we can hash it in this? For your time you know actually.

B

What you mean I, don't know much about the remediation algorithm, so I will let slaviansk read that and I think I think this is part of the problem. With this actual approach.

E

Rather comment this morning about the implementation: I know if you've read that one right now gymnasium and it's pretty similar.

B

No don't crash a critic, that's easy! That's Suzi! Fabian I saw you unplug the ethernet cable.

E

B

Talked about this okay, thank.

E

B

E

So gymnasium generates a list of enemies. You think it's very today's and the didn't suffice, passers so anyways it finds venema tees, but it finds them in its own using its own data model. I'll convert that to gravity's.

E

Generic vanities and I mean the JSON report, and it we use is recycles in a way. Well, um this kind of results to perform remediation, fine, small fines, vs remediation objects, I mean yeah.

A

E

To speak and convert them to JSON remediations using the the common library and right now it works because the the CVE ID this is what we use to reference. Penalties from remediations are predictable, so you can, you know, haften the curve works, even though remediation works on via limit ease that are not the ones in the reports with possibly unique ids, but the one in the internal data model to nauseum uses.

E

But it's fine, because it's all we can generate the CVID as long as we've got all the fields, and so this is why anyways well, you can have a look at my notes with the comments I share with you it so in there it's pretty similar in the case of cloud. We would. We could obtain the card in such a way that it works. Even if dumb the ideas are random, that's possible, but we have to update the card and I know I, don't know if you're following it is because my connection is not stable.

C

Yeah I'm, definitely following it I think the only thing I see is this use case, where maybe it's ran at a different time and they were random being able to resolve that lookup, like you said, we'd have to write code for that. If we did it from code and it ran in multiple pipelines and it would no longer be unique across the analyzer, so they're kind of in conflict at that level.

C

Does that make sense Olivier where you're wanting this thing across, but then the actual analyzers, when they need to auto remediate to almost need to take a specific current output and generate ID. So they can say this is fixed yeah.

B

And this is why we shouldn't chant that and it's Telus five I was pointing out. We need to generate the ID when first passing the report and find how the findings are they on your touch, ID and then, when we do the remediation we need to have two different lookup then generate means the same ID and finding the match, because it's no longer possible, but instead we will leverage formally the same logic, which is how do you recognize, which one remedy you're currently trying to fix by looking at the primary identifier and the location, but.

A

B

Think internal to us in the way we are doing remediation and probably depending on the way you are implementing the remediation, you will need different ways to look up at the finding. So.

A

We have no plans to support third-party.

B

Remediations they could it's not that they are responsible to do it their own way. Well,.

F

B

F

Put the code bit in that remediation chunk right just like we do yeah and.

B

They just need to make sure that the remediation they are putting in the report are provided with an ID that is matching which findings they will fix, and it's an own job to figure out how to do that, and the reason why, using it, an idea, specific ID properly is that we don't want to infer them to disease in any specific way. They do their own business and we do our own. We can provide them explanation about how we are doing it, but we don't want to enforce it. Okay,.

C

So the this very much so needs to point to a specific occurrence, because it's saying this occurrence is solved and then the lookup would be the same way that we would group, vulnerabilities or whatnot saying this is the same kind of vulnerability and we've solved that kind. So anything that matches this grouping of the thing it solves those and then it points out specifically this unique identifier across all runs saying: we've solved this and this and this and this and this all of these occurrences right is that yeah.

B

Because, currently, we consider that the remediation is attached to the same pipeline. That is a finding that has been reported, and this is miss make sense, because, where you're fixing things is often tied to the specific location at this point in time, so I touch. This is current state of the other projects or just go to the line number and thing.

A

Of that so I wanna get an answer written down for this one. Do any randomness I.

B

Think so, because this makes sure that we are not making duplicates well.

F

It also reduces the risk that, if a customer is using both US and a third party, that we both end up with the same ID, if we had a random element.

C

We could also just index and join that with the report run. That would be guaranteed unique right.

E

Yeah, if I may, in the Opera, the connection is stable. Now in the merge classify now, the idea is generated using all the fields of the validity. Finding.

E

The good thing is that if you, if you find a remediation for one and you have decade, then the remediation applies to both, because that's the exact same enmity based on the fields we've got. So it's not an issue, but then, if you want to debug and debugging is never need, you have it's not as I create, because you may want to say you may want to refer to this specific finding fine in this generated by this particular scanning job, and you can't talk about duplicates in a meaningful way when they have the same ID.

E

So but I love debugging to me having the same ID for the same feelings. The same object. It's fine.

A

Maybe so it sounds like what would be nice would be a a semi random identifier, namely one that seems random like a time stamp but is stable, so prevents collisions. It's not something to depend on, but at least that way for debugging etc. We can, we, neither I, don't know, use some kind of mock or something to ensure that's easier. That makes sense.

A

This is a bit one-sided because Bolivia laughs, we can make all the decisions that it.

E

Does and maybe we can have- we can randomize that with an ID that would be specific to the scan and that would be very convenient for the tests actually, because if it's all random, there's not much, you can test, I mean it's thanks for complicating tests, its do it yeah it's great yeah.

A

That's what you're talking about yesterday is like I I'm wondering how annoying this is going to make our test projects and updates to those I guess. If the data were never to change, then catching the entire thing would be fine. If it's random, then we're going to have to update our tests with every run unless we change our comparison logic again, unless.

E

To be ignore the film, but then oh, and we would have an extra test just to prove that it's unique and then we would have so we would have it has to to prove uniqueness, and then we would have test the test, proving that there's a double laying between remediations and of enemies are good would be more complicated because you don't know the exact value.

E

So you can you get compare, which is fine, a bit again, it's complex and needy and yeah, and if you randomized or if you inject the idea of a scan like we combine well, then we have you IDs in the context of the pipeline, which is which is just easier to deal with, because initially this idea of combining the the unique idea of the scanner with you, uniquely I mean with the ID of the affinity that could be.

E

You know unique in that context, but it's it's just easier in the ideas, unique in the context back line again, keep in mind that I'd like to be able, especially for the Canon scanning, because it's expensive to have a remediation job out there in scanning itself. Out of a scanning job so that people can skip it, users can skip it, I'll tweak it set a different time out, but not, and if you had yeah.

C

That makes nice.

E

C

E

C

Thing to expose there is here's, the I, not ID, but the thing that you can use the group what all this solves and then rails or wherever else can look up all the occurrences is solved specifically right. Sorry, could you repeat now so if, if Auto remediation happens at a different point, we no longer know what the the random IDs were created for those things right and so we're different, probably even in the same pipeline. Just be just be here, not okay, so just so, maybe it could pass it forward or something yeah.

E

So yeah the idea that cares it at the remediation job would import the would process the output of the scanning job. What is not okay possible right now, because reports are not expose us, no artifacts. But let's say let's say in a future that is expose as an artifact.

E

Actually it's possible to to try to few nine few lines of code into the mo CI configuration file to do that and remediation job would process this artifact try to fix penalties and then create a memory mediation of sight using the ID of what has been found, and what's that as base, has been fixed. Sorry so that works. No. Nobody can involve to do that, which is.

C

Good so a scenario here, I'm curious if this would be an OK constraint, so Lucas myself and you all open @mr toward at a repo and we're all building off of the same master. So we're all have the same vulnerability or it would all find the same occurrence so to speak. All of our pipelines would find it generate some random unit. Id, you know unique across all runs. It could pass it on to the next job.

C

That would know to say it fixes it for that occurrence, but when it actually technically fixes, you know all the other ones to know and yeah.

A

I got a drunk because there might be a pole with the dental, so I'm gonna jump over there and that might in this recording. So if you're all good, then cool but.