GitLab Secure Stage, 8 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Secure workgroup on splitting build and analyze in the pipeline

Description

Secure Group discussing on this issue: https://gitlab.com/gitlab-org/gitlab-ee/issues/10479

A

All right, perfect.

A

Welcome everyone and I see some new faces there. That's good! We have a working dog. I will add that here it's a little indication that I want to make sure that you have that in mind.

A

A

Quick intro, since we have new faces, we want to talk about feeding the the some faces in the the analyzers, especially with the building. The generalize affair is the problem that we have is mostly that with various environment, it's hard for us to support all edge cages and all the cages in the world, like all the the different versions of Python or Java or whatever, and the only way to achieve that is to a two-phase process.

A

So that's why we stop in to discuss that's that kind of things, and so the first note that I added to the dock is we have a customer that are exactly that kind of made their some Python project and dependency scanning is not working because we are missing some operating system dependencies. So it's the same story. Over and over again, we need to provide a way to the users to provide their own environment, meaning the own version of Python, for example, and their own system dependencies that they wouldn't start.

A

So one way to achieve that is apparently to use the workspaces but I'm afraid workspaces won't be around for awhile I, see a 12.4 milestone in the issue, but I'm not so sure that it's going to be available inside before.

B

Yeah I think that one thing that we talked about- I- don't I, don't know if you've said this before, but we could always work cross-functionally and if we, if it's so important to us, we could contribute engineering resources to verify its effort. But I think one of the big issues is that requires so much domain knowledge that we can't really I, don't know how much we can do to work on that portion of the pipeline to try and get that feature done earlier.

B

Maybe I'm wrong if it. If anyone here has more experience with working on that portion of the codebase, then it could be worth exploring.

A

Yeah, that's a good point. If, if, if we want to find a work come on, we need to find a workaround, that's we can do before the actual availability of workspaces. Otherwise, it's not worth if we start to develop something that would lead to add something like the workspaces. In the end, it's not very useful.

A

So, based on the fact that we don't know when exactly the workspaces would be available and even if they would be available right now, they wouldn't not. They would not solve all the all the problem, especially with regards to the difference injection.

A

So even if we add workspaces right now so to give you a quick idea of what workspaces are it's a way of stacking over the same environment? Again and again, it's pretty much what we want to do so you get, for example, a Python environment and you install or your difference is there, and you can use that as a kind of base image and every time you run the job, it will use this very same environment. Maybe add some more layers to it, like you would do with with the doctor.

A

For example, it's going to be shared between the between the job, so it's right in the middle of the artifacts, the cache, and there are two images. The remaining problem that we have is even if we add that we need a way to inject, for example, the dependency scanning analyzer into this workspace. So my ID there and that's what I want to explore today with you is: maybe we could have something, because we could have something in the middle.

A

The problem that we have is we're trying to either start from a base image that is not ours or use some base image that is ours but restart into the into the users environment. So why don't? We have something like in dependency scanning I would have the base image provided by the user and in the script or the before script. We would do the installation of the analyzer that we want to use it.

A

It should be generic, but it could be also specific, like I want the Python on either or I wanted dependency scanning and the rice on the other. But by doing that, we should not rely on any of the system. Dependency. I think that's the case. Today's, which should work. It should not rely on a specific version of Python or Java or whatever.

A

In this case, the user would have the ability to install the required dependencies like PostgreSQL contributions, development packages, etcetera, etcetera, and then we ask the user to call the analyzer that we want.

C

Yes, similar to what Fabian was suggesting in the approach where we were just executing the analyzer within the context of the build job like defining some steps, some scripts actions like what's the process to set up the analyzer environment, so yeah it can work. You can work and it's probably easier than trying to mix two different base images.

B

This is essentially releasing binaries right and what we talked about before no.

C

You don't need to have the access to a binary. Sorry, you ain't just a binary fuzzy, analyzer, yeah, okay! So it's like yeah, it's a great kind of material steps show so start it super equals it download the binary of the analyzer and execute it. Yes, that's it, and then we were thinking like something as simple as just a curl just create and bash and pipe it to a bash or some other shell to make it a one-liner.

C

And here, if I understand here, the goal is still to have that in our specific job. So we can have a detailed script so pretty easy to maintain.

D

Just a quick thing to recap for me, so.

D

When the user, for instance, I remember, we had lots of problems with Python projects, different user of Python project in shades. It's a CIA pipeline, so different builds. Are the dis build steps that I required to actually build the application and to prepare it for SAS Skagen? Are they different or is it actually the same, build step that the user user usually executes before building and release in its application? So my asking base is that for me it looks natural that all of the dependencies should already have been installed before scanning by our analyzers.

D

So maybe I'm missing something.

C

Now we are talking about adding the dependencies of deny yourself here. This build steps are for installing the environment to run the analyzer itself. All the dependencies from the project bill has already been installed in the build job and is exported into this base image that we are talking about. So you we get this order of the build step, put that in a base image, love that image to run our SAS job and from there.

C

Instead of having our base image containing the sassed analyzer, we just use the base image from the user, and in there we have the script. My like install the SAS analyzer then run is after manager. What could go wrong is for specific tool that need a specific environment.

C

We might need to just reinstall some stuff, but I mean like we have a Python to project, but we have a Python 3 only compatible analyzer just means the project is build within Python 2 and everything is installed correctly there. And then you have a base image on this and then what we do is just install Python, 3 and execute our analyzer in the context of Python 3 Sadler's ID could just work but I expecting some edge cases where I don't know. Maybe some of the analyzer will require runtime live execution of some I.

C

Don't know crazy, stuff, I think mostly about gymnasium maven plug-in because from maven analysis we have a chain in maven plugin. So I don't think it's a case, but it will require a maven I, don't know 1 2 3, which we try to rebuild the project at least try some mavin cycle commands which might not be compatible if the project is not using that version of Maiden, but I think those are very, very edge cases. So we might just don't bother with that right now.

B

So if we were going to release our scanners as Carolla below binaries, then we would need to have potentially two different Python two and a Python three dependency scanning binary and just like anything else, you could fetch the relevant one to scan your code. If we got yes that.

C

Could be another option so.

B

C

Can do that today.

B

Like with our with our doctor and a nurse though, like I, feel like, we don't do that for a reason which maybe it's just inconvenient.

C

Because we don't, we didn't thought about that baby yeah. It requires a different name. Scam for the images probably make it harder to to set up. Can you repeat.

A

Data, yes, please I was fighting the Google Docs and getting crazy.

B

So, okay, so ignoring the entire working group we're doing right now. The proposal I was saying was right. Now, if we have Python 2 and python 3 issues, why don't we release a Python to a gymnasium, Python 2 and a gymnasium Python 3 scanner?

B

That way we don't have to worry about one scanner doing both and in a similar way. If we're doing this approach, where we expose a binary fetch, it and user runs it in their container. We still need to release multiple release versions in some cases that might have to be a different architecture like Windows versus Linux or something, but either way we're going to have to support many different releases.

A

And the problem with that is, we are only solving a portion of the problem we are serving the interpreter or the compiler version incompatibility, but all the system dependencies that wouldn't solve. For example, the customer case that I mentioned earlier on this meeting.

A

Did you take a look at at that point I just the the bottom of the dock, which is pretty much the opposite of what we are doing today, not sure if it's clear for you one.

B

Yeah I think that the issue is that this currently requires a user to be producing a docker container for their build. Otherwise, what happens is maybe you just left it there for like quick sake, but it requires the image to be a user provided image. Normally, when we want with workspaces, is use a previous job. Instead definition yeah.

A

The idea is we, we do all the on word in the templates, what you're typing on it it would be in the template. Actually, we don't need that right. There. First of all points, yeah.

C

You're right, what's our compression, but you indeed right. This should be in the template itself, just to explain that.

D

Okay, I just I'm, not I'm, not an expert in doctor. Fortunately, but is there an opportunity to share some volume between kernel? Excuse me, analyzer image and user base image, so the user base image would have to like walk up. Shared volume directory would all obey and the world it would be actually working directory and we it will. Let us to maintain our current release. Eema of doctor images for our analyzers.

D

Wouldn't it make us, would it make it make it simpler? We won't have to shift our lease model.

A

We can choose module because we have an isolation between the documents equations. The only thing that we could elaborate would be the artifacts on the cache which is actually the same. It's not a very intercept are speaking, but it's it's kind of file system that you can share between the jobs, but even if we do that, the artifacts that are limited, for example, to the Tron Jordan binder to convert directory. So if you install anything at the OS level, it's not going to be part of the artifacts.

A

We could use the cash for that, but I'm afraid it's going to I've, seen so much edge cases where it would not work, because you would basically override the wall, find the world grunt file system and I. Think it's not going to work, and it's the only way to achieve that. But the idea that I want to discuss here is a bit different. It's if the user doesn't provide a base image.

A

We could use the default one that we have, for example today, which is part of our compatible. But if you have something very specific, like I, have a Python project with siku to install and I need for that, the Postgres the package install on the system, then we can ask the user. Okay, you have some specific needs. You need to provide one image, the way to provide your images here. You provide that image. The script would remain the same.

A

Actually in the templates we would curl the binary or actually I was thinking of also apt-get install something it should not be on to provide Debian or Alpine advantages. That would work on many many many different distributions. It's not really that hard I've done that in the past, and so in this Creed by default, Winstar, the analyzers Winstar, all the files that we need in the image. Whatever is the image is, and by doing that, we also opened the door to okay. Maybe you just have the base image that is Python three specific.

A

If you need something even more specific, we can use a before screen here and.

A

Apt-Get install what you need there by doing so we don't need the workspaces. We don't need to update the CI, the runner anything in github. It's sure, template and job magic. Does that make sense? What I'm saying yeah am I missing, something obvious it's. It seems to be too good to be true. That's the problem.

D

Can you please repeat that part? Well, we are getting rid of the base in which the case where the user does not provide the base knowledge.

A

So the idea is to want the other way around. So what we are doing currently is not what I, what I were there? What we do is we have already matched up we built and we put everything that we need in there, and then we use that image with the project of the user. So we kind of import the project within that image and we run the analyzer on dance. So that's the problem because we use our image in this case. It would be the other way around so by default.

A

We would use that same image so that it's backward compatible. But if you are a user, for example, with a Python project and use cycle 2, which requires post, we PostgreSQL they have to be an instant. In this case we could tell the user, you have a specific need. You need to provide us, you need to give us a base image and we would do the rest. So we are in the full environment of the user. We have the dependencies, the OS, the interpreter, all the compiler project himself and we are injecting the dependency we're.

D

Installing our analyzer into user image, exactly okay,.

A

Problem that it would create is, does the analyzer works in working in any condition like I? Have a bite into our Python 3 interpreter is going to work the same. We might need to adapt it a bit, but if we do so, we have everything in runs today.

A

To achieve these kind of things, we can also detect if it's a Debian based OS or in Alpine based ways and run apt-get install if we have multiple files and- and it's probably going to be easier than cure, because we don't have the support of binaries of releases right now, apart from the artifacts, but it's not super clean I. Just put that in the repo we.

C

Did that before you wrote that work sure you just put the binary in the repository so.

C

A

Not the best yeah.

C

All the way we can also wait six months and Python we beat drops.

B

A

This yeah did you get install posters that there's no way today.

B

To achieve that, but there's also the issue of air gap systems right and- and so I was wondering. If, with with this current approach, we would need, we would require a customer to post a package post. One of our packages elsewhere and I was wondering if it would be possible for so. Assuming that we can do everything in NPM.

B

Could we ship our own binaries in customers and p.m. registries within Ghaleb I.

C

Don't think that would be the same as for shipping, the docker registry with pre-installed images, maybe yet the good that's a good, interesting lady, by the way, maybe could be part of a set up script to pool some images or twinky install something into there, but that's definitely interesting way.

B

Mm-Hmm, because there will always be that issue and if customers want to avoid. Currently, we just say you need to host our docker image, but for the binaries we need another solution for that as well, and it's honestly, it's a bit easier for customers to retag Andry hosts a binary on an internal registry, but um I.

B

Don't if I were to tell the customer like yeah I, just host your own binaries cuz I mean everyone probably is. It seems more awkward. I, don't know.

D

Speaking about binaries I, wonder are all of our analyzer, so it can be just expressed as a single binary. For instance, our Python analyzer, isn't its collection of behind almost clicks or maybe like I, haven't honestly diving into the repo, but I need I feel that we need to be sure that all of our advisors actually can be compiled in a single binary and I'm, not sure about.

C

That now, because most of the analyzer is a binary, but it's just calling system commands.

B

So I I were trying were basically rehashing an issue that we worked on a while back called leveraging existing builds where the solution that we came up with was basically what we're doing today. But I did look in that briefly and there's a tool called packager, which is basically a way of just including static files in a docker binary, it just pulls pulls the files in. So we could do something like that, but it's pretty messy and it doesn't quite include everything we basically have to include entire run times in a single binary.

A

Away in go to including food assets directly in the binary, but I'm not sure would.

A

A

Because if you do that, it's mostly for assets when you do that kind of things when it comes to, if we have any system difference, is not going to work because some of the dependencies they require some very specific, writes on the file system or very specific path of the PI system. Seven I would definitely push in the direction of adding debian or alpine packages instead of a single binary.

A

If we have a single binary, that's fine, but and anyway it's it's something that we can iterate on, because if we have the control of this quit, which is the case, if we add that to the templates, then we can start with the binary and use whatever we will spit on needs in the future. But that's that's. Why I like that? It's we throw compatible.

A

Use the same base images of today and inject analyzers.

A

That's that's something that we could. We could try.

A

You know, I, don't see any in case whether it would not work out of the box.

A

I, don't know why we did. We have the top of that before.

B

D

Sort of the option we choose as our like primary option to go next for is packaging. Our analyzers right I mean the Linux packages that.

B

So that that's one thing, another action item that we can take today is we need to go through each of our analyzers and explore splitting out the build and scan stages. If we essentially wrap all build stages in a like command line, flag or and bar, then it will make it easier for us to transition away. So as part of the move towards splitting, one of the actions we can take today is to start refactoring our analyzers, to make it optional and then eventually pull in a binary or other perfect I.

A

Would say in this case.

A

A

We can detect if we need to to run the dirt in this case, we would run it like what we are doing today. If, if it's not working, that's that's a shame. That's.

A

Exactly the point that we have, but because we are not going to create new jobs, that's the problem. We need to have something that would be backward compatible on the Y's. We will create something for 13.0, which is 10x majoring in May next year. I think that if we can make that which were compatible, we can have something before the end of the year, which will solve the problem. So we factored in risers and what was there at the other direction? A binary.

A

D

Stand-Alone analyzers.

A

Alright- and we are at time I was I- will.

B

I would like to carry over one more action from I think last week or the week before to which is like we should look at the the link that you shared at the beginning here for our customer. We should target a real-world use case for use, so we, instead of a cigarette tied up Stanley eyes, it's maybe figure out how to ship the gymnasium Python and laser standalone. So we have them. There actionable stuff.

D

That there is a plenty of information to collect across all of our analyzers and also example projects.

D

So we need to choose the yes of the place where all of this information will be collected like in the description of an epoch or description of an issue, hmm maybe in some like tabular data, what analyzers already support detecting they need or build step. What a lot of the not so all of this information will be in front of our eyes to easily navigate it and to easily understand what is the current status of this research.

D

Do we have an epoch or an issue for this initiative? I, don't remember.

A

We have the domain issue, so we can produce that from there. You.

B

You'll never find it again.

A

Alright I'm gonna stop sharing now.

A

Well, that was useful. I will update the the issue with all of this, and I will keep you posted. Thank you very much for your collaboration and suggestions, and next week.