GitLab Govern::Threat Insights Office Hours, 24 Mar 2022

Previous Meeting

⏯

youtube image

►

From YouTube: 2022-03-24 Security report parsing and ingestion brownbag

Description

https://gitlab.com/gitlab-org/gitlab/-/issues/267887

A

So I start recording hello everyone who we have here yeah. I think everyone has the correct, zoom link so yeah. uh This is our brownback session, uh informal technical meeting for security for parsing and ingestion. If you're expecting this to be like highly technical, uh probably it's not gonna, be I'm just gonna. I will just try to go through the ingestion and security report passing like in a really highly abstract level.

A

So uh let me introduce myself. I met them in uh inaudible people pronounce it uh muscles time wrong. They say enoch. I am working as a software developer at thread insights, so the agenda will be. I will first show you the different channels like how our users can interact with the vulnerabilities uh appear in their code base, and I will talk about the previous versions of report parsing, the pipeline security tab and the report ingestion.

A

There was no report ingestion before it's a term we recently invented. We previously had a story port service, and I will also talk about the current version of report report passing the security top the pipeline security tab and the report ingestion and also the future version of report parsing. The pipeline security tab and the report ingestion also at the end, I will try to answer your questions uh if you have any so.

A

The different channels we have within the system are the pipeline security tab like uh this pipeline security that lists all the vulnerabilities from in the code base for that particular pipeline run and the vulnerability report.

A

We have wonderful vulnerability report for project group and instance levels and then the mr widget, which shows which vulnerabilities are the new and which ones already been fixed. uh So the previous version of the report, parsing was basically like just pasta json. If you can, uh we didn't have any validation uh in terms of like schema validation, and we were just trying to pass the json and create the plain old ruby objects and let the rest of the application handle if there is any problem with the with the data.

A

uh In the previous version of the pipeline security tab, we were downloading and parsing all the security report artifacts for each http request, which were mostly timing out uh due to the size of uh security report. Artifacts. Imagine.

B

You're, looking at.

A

C

Sorry, I'm sorry that that's.

A

Not for you that was not for you.

C

A

Okay, so imagine you're loading the first uh first page of the security report in the pipeline security tab and the security analyzers uploaded 10 gigabytes of report. We parse it. We create the ruby hash and then which is serialized to json, just for first findings uh to be viewed by the user, and when you try to load the second page, we do the exact same thing, download all that all those data and and then parse and and so on.

A

So uh of course, like thiago, was sitting in front of his uh adjustable desk and trying to understand why it was loading so slow, of course, because deep application was trying to download the universe and then pass the universe and so on. uh So this was the this was the previous version of the pipeline security and the previous version of the report ingestion. We had one uh ruby file, ruby class called storyboard service, it contained uh 499 lines of code and everyone was highly highly coupled.

A

It was hard to make changes, because when you touch one place, it was breaking another place which is completely unrelated and also it was really hard to test because, like everything was coupled uh writing a unit test in isolation was impossible. So you, your test, basically had to run all the logic in in that class to be able to test, maybe a small edition.

A

uh Also it was creating the vulnerability requests one by one. uh There were therefore taking a long time to complete. Basically, like we can say, people are usually using m plus one query problem file reading the data, but this is also kind of important query issue and the current version.

A

So, first of all, I want to fix terminology, because I was using it wrong. I was using soft validation and hybridization. Then we came to conclusion that terminology is validation, means validate the given security report and nothing more. Enforcement is skip the invalid security report ingestion so from now on, we will be using these terms to describe the stuff.

A

So as of 14.9, the validation happens by default. This is right now happening and those validation messages are shown as warnings and those warning messages may or may not prevent ingesting the reports into the system.

A

So this is how it looks in the pipeline security tab with the yellow background and warning sign and what the warning is uh you can. You can see this in the current with the current version and enforcement is. uh If the report doesn't comply with the securities he must be here. uh We just discuss them. We don't even try to ingest them.

A

uh Users can opt in for validation and for an enforcement since gitlab version 13.11 uh by setting the validate scheme ci variable as true uh and the validation enforcement will be default and permanent future by 15..

A

So if you, if you see a user, wants to experiment enforcement feature, you can suggest them setting this uh ci variable in their gitlab ci yaml file.

A

So this is how it looks like under the variable section, uh we have validated scheme and set as true, which means it will validate and enforce, and this is uh how it looks in the pipeline security type when the enforcement is enabled. Those things are not warnings anymore, but actual errors, and the vulnerabilities contained by this scan will not even be tried to ingest it. So uh we can also have warnings and error messages showing up at the same time, because we will also show warning messages about like hey.

A

This uh report version has been deprecated and will not work with the next major version of uh neetlab to inform the users about upcoming, braking change.

A

So the current version of the pipeline security tab. uh There are two versions. The first version is the old one like download everything, try to download everything everything pass everything and prepare the response. uh There's another version which uses the security findings table to download just a subset of those artifacts, so security findings in in security findings. We are just storing the metadata of the findings like severity, confidence report, type and and so on.

A

When we receive the request, we try to calculate the page content and then check the security findings, try to understand which artifacts to download. So instead of downloading all artifacts, we download one or more artifacts.

A

So this was working better than this, so this time, thiago again sitting uh in front of his adjustable desk and trying to understand why this is a bit faster right now. uh That's the reason why it's a bit faster, so current version of the report ingestion invalid reports are discarded if the enforcement is enabled, which means uh if there is an error for the security scan, we don't even try to uh ingest the reports, for the scan vulnerabilities are being created in batch of uh 50 records at a time.

A

uh So we are not doing this one by one anymore, there's a separate task class for each individual entity in case. If you're curious, we can also go to the code base and check what are those tasks?

A

Tests are separate for each task and have their own spec files. So if you want to test something that you recently introduced in the ingestion logic, you don't have to run the logic for all the ingestion. You just need to you. You can just run. uh You can just test what you implemented and tests are running faster as the whole logic doesn't run over and over again. uh Ingestion logic is now running faster, faster with less resource consumption. uh I will show some metrics afterwards.

A

uh We haven't seen any unexpected errors, since we enable it like zero errors uh in the in just an ingestion logic which actually helped us a lot to improve our error budget, uh which is almost green. It's 99.94, I guess, and any active record validation validation error. This is actually like a downside of this uh approach.

A

uh We also apply active record validations to prevent creating invalid records because, like maybe schema, validation passes, but we introduce a validation model validation which doesn't validate the record, so this discards the whole batch.

A

We are aware of it and it's gonna be fixed by a follow-up issue and an error during the ingestion is visible on the pipeline security tab. So if there's an error, we record that error and show the user hey. We had this exception, while trying to ingest the security report.

D

uh Hey limit sorry yeah, quick question on that. So since we are showing ingestion errors, does that only apply to schema validation, or does that point above if there is a problem during the um the model ingest, if there's an active record problem, would we surface that as well? Right now,.

A

D

We show all of.

A

Them: okay, perfect. The tag of the error message will be different. For example, if it's a schema validation error, the tag will be schema. If it's an ingestion error, then we say ingestion or if we have a parsing error, even like, if it's not a valid json file, the scanner provided, we say parsing error. So we have pretty granular error messages there, fantastic okay, uh so this.

C

Yeah sorry in terms of the whole batch from it for the current um validation error is that every single is isn't the whole batch per table or is that the entirety like identifiers.

A

And the the entire batch, because we don't really want to leave the database in an inconsistent state right, for example. Let's imagine we created the vulnerability finding, but not the finding links uh we just roll back transaction and uh it stays in a consistent state.

A

That's an a that's a really nice question, and this is about 75, percentile and 95 percentile of the uh report. Pricing time uh sorry report ingestion times, and I think it's not that hard to see the place where we enable the future flag.

A

So it's now faster and also like for the night for the 75 percentile uh duration, uh there's a consolidation, so it's more uh foreseeable in terms of uh the time it takes. uh This is a bonus. We did the same thing for the store scale, service, exactly same approach, creating records in batches and the ones who play on the stock market can understand.

A

This had a shoulder pattern uh after the head and shoulder pattern. It started going down a lot um yeah. uh I also prepared some flow charts. uh Maybe there's no point me going through these charts, but maybe you can on your own uh check those later or maybe we can even put those into our documentation page for for other people as well.

A

I will just tell you how to is true uh you'll see. For example, there are some sub procedures, for example in the middle of the flow one stores can group in the circle, and there is the second sub procedure. Those procedures are described in a different flow of chart. So, for example like this, uh as you can see, uh I also I already showed the previous one and the next one, and to just like give you like to just not you, you lose the context.

A

I always give two things together, the previous place, where it's found the sub procedure and the sub procedure itself. You can read these flows by this way by the way. If you want, I can go through the flows. I can guide you through the flows, but if not, you can later go on your own.

A

What do you say?

A

I think you are fine yeah perfect, uh also like. uh Yes, these are the actual ingestion tasks. We have you may wonder why they are colored, because I wanted to highlight them and also after preparing the presentation, I realized it's it's possible to give them a background color. So I thought they look fancier.

A

So this is why so, maybe I can give you like, I can read through the tasks, so you will have understanding of what tasks we have. So the first thing is we ingest identifiers because the identifiers are shared between different findings and then we ingest the findings uh because to be able to ingest the vulnerabilities we need. uh We need to first injustify that if this fails there is no point creating the vulnerability, and then we attach the findings to the vulnerabilities, basically updating the vulnerability id attribute of the finding.

A

Then we ingest the finding pipelines. Then I we associate identifiers with findings we already created in the first step, then we create the finding links, then ingesting the finding signatures and then evidences vulnerable flags, issue links and so on.

A

So with this design, uh whenever a scanner group needs to add a new task in this pipeline, they can introduce a new task completely new. They cannot own the future, they can run the code base. They can test it easily without the threading science to you know like uh check the logic of course, like we are here to help you like.

A

If you need help how to design a task, how to implement a task, uh we will be more than happy to help, but it's not very easier than trying to place your single function within a huge uh class service class.

A

uh You can just implement a new task, uh so this is actually like lucas, as you already asked. So if there is an error in this pipeline, we roll back the transaction here and also save it, save this error on the security scan, so it will be visible to the user.

A

So uh the future versions.

A

The future version of the report- pricing as well dimension. It will be mandatory and a permanent feature by 15.00 we're almost there. uh The official version of the pipeline security- I mean the the previous version wasn't scaling.

A

The current version is also not scaling well, so we are trying to find a way to make it more scalable, but we can't basically store all the security findings in postgres sequel, because postgresql is not horizontal scalable, uh because and then we have gigabytes of data, maybe terabytes of data when it comes to vulnerabilities.

A

Imagine like scanners are generating thousands of vulnerabilities for each pipeline run. So this is why we are looking for another database engine, so maybe elastic search because we can uh scale it horizontally, basically having five or ten shards each chart holds just like a partition of the data uh which will make it easier uh recently. I also started thinking about using crystal db uh to like as a manager for different data stores, but it's just a rough idea, and I didn't even put this into the presentation.

A

uh So the future versions of the reporting- just ingestion, as I mentioned, like an active record validation error, uh churns the whole badge. I think uh discarding that just that single record might be a better alternative, so we will try to achieve that um and also we are working on us ux improvements to give users an indication that there was an error while ingesting the report in the vulnerability report page.

A

We already have this in the pipeline security tab, but we have an issue to show a warning message on the vulnerability report page which gives link back to the pipeline security tab. So users can easily uh check what the error must so yeah. As I said like it, wasn't that technical. uh So if you have any questions, I'm happy to answer or if you want me to locate you where this new ingestion flow ingestion logic in the code base, I can also open the code base and show it to you.

A

I think you will you don't want to see it or maybe you don't know the place of the ingestion service classes.

C

So um because we have a couple questions already they're more high level, maybe we'll just do those, but um ultimately maybe it would be worth digging into those if we have time at the end, um because ross and I are actually working on something right now that is in the code.

C

um I I guess first I was just curious um thanks for that. That was really helpful to to get the overview. um Do we have like a integration test? um Sorry that the part where ross that that is mentioning ross in that paragraph, so we're so number two is part of number one um in the talk but um ross and I have been working on this kind of multi-step flow.

C

So um do we have an integration test around the whole pipeline um because we're we're trying to test something that involves like modifying secret security findings prior to ingestion, and so it's kind of like a fairly like integrated task, so so we're trying to avoid, like writing um what I call an old school test like a very large, integrated store security report service. Spec, I'm curious. If we, if we do, have anything that ensures that's all working as expected um from like report through yeah.

A

Right now we don't have something like this other than the real integration tests. We have. Jonathan has more knowledge on this, as he was trying to fix the integration tests right. Jonathan.

B

I don't know how much more knowledge I have on that, but uh but uh I know that uh harsha has been looking at the integration test and trying to find uh the missing places in there. um He had put out there's an issue and uh let me find that issue.

A

But my understanding is more like like, instead of having an integration test, you are more like asking if we have a unit task which is crossing the borders of different units right. Yes, that's something like I created the pipeline and I see the result on the vulnerable 3d port page, but more like uh the whole. The whole coverage for the ingestion logic.

C

Yeah, I I I guess I wouldn't call it an end-to-end test um here is the I can just link to like what.

C

C

I've been using to test this feature, um which is basically just like a.

C

Ingest fixtures and.

C

Yeah, it starts at that comment in the store uh security reports, worker.

C

So it's very similar like a just run. The entire worker and check vulnerability counts. um Yeah. We don't. We don't.

A

Have it right now, but we can definitely implement something like this yeah. That's that's a good idea.

A

Do we need to create a follow-up issue for this, or would you like to collaborate on on this? Mr.

A

To create the integration test.

C

Yes um I'll create a follow-up issue and then we can just uh continue the conversation there. Yeah that'd be awesome thanks. Thank you.

B

I'm also still trying to track out and track down that issue. That harshly started about the the missing intent tests for the vulnerability.

B

So looking for that.

D

A

The second question is also from lukas: yeah. Okay, would you like to verbalize.

C

Yeah sure, um I'm kind of just curious if in doing this, um this rewrite, um I know that there's a bunch of plain work around like data model, cleanups things like uh using the vulnerability reads, table more and like removing unused columns things like that, but I guess I'm curious about your take on like um from a what are any like high level takeaways on like how we can better model our domain objects.

C

Here, it's it still seems pretty complex to do like kind of the poros to security findings, to vulnerability, findings to vulnerabilities and now we're adding the vulnerabilities reads.

C

So is this where we want to end up in terms of like how are modeling those domain objects, or is this did did that come up at all in in terms of rewriting and how we can improve that, or is that seems to be like kind of the best thing we can do.

A

uh I mean, while implementing the ingestion logic, I definitely saw some places we can improve, uh like redefining the associations, maybe merging the finding, with vulnerability or other stuff. uh But our plan was, like you know, making the ingestion as reliable as possible and using less less resources.

A

uh I mean we can maybe go with the star model or like try to how you say normalize the data as much as possible, but then the query impact will be painful. So this is why we invent the vulnerable trees to segregate the queries from reeds.

A

That's that's a really huge topic. I mean we definitely need to discuss this a lot. uh Maybe we can even have a spike and then like try to try to find uh what needs to be done.

A

But like see, since you are asking quests this question, probably you have also seen some opportunities. Some room like a room for improvement or something uh if you have uh please don't hesitate, sharing those.

C

I I guess the only other um item here would be like a a deeper um use case for this. That would involve just talking about the problem that ross and I have been working on solving and around the ingestion service, but because that's like a very specific use case of a deep dive here, I I would want to just like ensure that, like if anyone else has questions or topics that they would like to cover before that, um I I don't know was this set for half an hour? Are we over time.

A

Yeah, but I have time so: okay.

C

Okay, cool well, if no one else has a topic. um The issue that I linked to at the top there around a multi-step workflow.

C

um This is the this is what we brought up in the um office hours a while back for thread insights, which is for 15-0 we're looking at a way of allowing certain scanners to take over the findings of other scanners.

C

This means that we can deprecate eslint, which has a high false positive rate, but allow the existing findings associated with an eslint scan to be reassociated with the sem grep scan.

C

So there's a couple different ways that we've been brainstorming: how to do that um ross, and I have both been like spiking on two very similar but different approaches, but it basically involves um iterating over identifiers and reor reordering identifiers, so the so a lower identifier becomes the primary and recalculating the uuids for a lookup um for us. I don't know if you want to talk about the approach you have in place there, but um the basic idea here being like.

C

We want to make sure that the way we're doing this matches the expectations for how it would fit into the ingestion service or, if there's a better approach to this, to the direction that we're chasing here.

C

um I don't know if you have a link handy for us for your mr I'll, get that and put it in the docker.

C

Part of this is does touch on needing something like an into in test here, because it involves like a fairly far-reaching component beyond a specific task. We essentially need to update a primary identifier id on the vulnerability itself or the finding itself, along with re-sorting identifiers, to generate uuids correctly and things.

A

I see uh with this approach. I can definitely see some race condition. Problems happen right.

C

uh Inter in terms of like the same reason that we sort identifiers, so if we're.

A

Or maybe like imagine, uh we have multiple jobs, running for different pipelines on the same same project and trying to you know readjust the data.

A

Because then, it's possible that two pipelines finishes almost at the same time for the master branch and we have storyboard service running at the same time in parallel.

A

uh So maybe this can cause problems, but maybe not just a thing that we need to consider.

C

Yeah, that's right.

A

uh Maybe we can discuss this topic further in our uh trading sites back and issue refinement.

A

uh I think that that's the right place to discuss this. What do you think.

C

uh Sure I I I'm happy to bring it to that. That's what we brought it to before uh before initially talking about the solution. um When is that this.

A

Is uh it's on tuesday, uh probably it the time works for you.

A

B

It's yes! Next tuesday, on the 29th, um it's 9 a.m: central yeah, so you're in pacific right, lucas, yep yeah! So I mean we can we could I mean we could probably bump it back an hour so that you're not having to get up uh and on it.

C

If there's a like, if, if synchronous works for y'all, I can wake up early, but um we we just linked the on number three here, the two, mrs, so we're kind of like looking at different approaches.

C

So maybe, if there's a way of doing this asynchronously, that would be great just like if that, if, if you were going to discuss this during your refinement, that would be great. Just like any comments you want to leave on the mrs direction, um we're just kind of looking for a general direction check on whether this this makes sense or if there's a better way to plug in. We can figure out the actual logic separately, but more high level architecture, wise.

A

We can definitely uh add this to our agenda for the next meeting and discuss it. There yeah.

B

Hey uh we we do have um so we do have the recording from uh a couple weeks ago, uh when we did the when we talked to them the first time on this as well.

B

So that's that'll, be in the um agenda. Notes from the last three sites.

A

Okay, all right, let's check by the way I'm stopping recording, yeah yeah.

C

Perfect awesome, thank you and thank you for the presentation that was very helpful.

A

Yeah no worries and thanks for amazing, I was expecting more questions.

A

So any more questions.

A

Seems like no, so thanks a lot for your time for joining this meeting, I still recording by ah yeah, because I stopped presenting and not recording.