GitLab Verify Group, 21 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Verify Technical Discussions 2021-01-21 - Pipeline Editor and CI/CD data storage architecture

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello: everyone: let's wait a minute.

A

A

So yeah, I think it's time to start so hello, everyone to verify technical discussions on january 21st.

A

We do have a few interesting points in the agenda and miguel's is the first one, but I'm still not seeing miguel on the participants list.

B

Hello, hi all sorry, I was trading desks here at home and it took a little bit more time than I thought uh so I'm here on a standing desk, um okay, so uh yeah. So the topic I had for the uh for the last time was regarding the editor, uh and what I wanted to ask team was: uh what were your thoughts on continuous linking?

B

So basically, what we're doing right now with the editor is that, as you type on the on the pipeline yamo, you will get, uh you will go and fetch resources from the backend that tell you. uh If the link is correct or not, and then you will, uh you actually will invoke the graphql api for all the jobs and all the configuration, and then we will display it in, although it will display it in another tab.

B

All this information will still be loaded, and this is the information that is used, for example, to display the uh the pipeline graph and also to tell the user if the yaml is valid. So uh I think fabio is a the question in another issue in an issue where he said. Is this something that uh will be performant? Is this something that we want to do or do we want to find another approach?

B

That's what I want to ask everybody here.

A

Is that something that you are building or something that has been already built so.

B

Yeah, it's already released and it's it's public. So it's something that will I I raised the question two weeks ago and in the meantime we released that uh yeah.

A

Can you share this.

B

A

Would you be able to share your screen and show how it works.

B

uh Oh yeah, of course, yes.

C

B

So this is the editor for the current gitlab uh hitler piano. So, as you can see here, we are loading the uh information about the.

B

If the gitlab configuration is valid and let's do some typing and then, as you see, it's going to load again and I got a job and it should complain yeah it's not valid because it doesn't have scripts and if the script is, I guess the number will also complain yeah and then the linking will not be will not bring anything and the visualization will also not not happen, and this happens, as we type so can infect.

B

So yeah, so I'm going to now remove this and what we're doing is we are?

B

Is it cached, that's cool yeah, so, as you can see, I'm making requests to the graphql api that tell us basically all the information about the pipeline. uh So I'm gonna here here are all the stages, uh of course, for a project like gitlab. It can be quite extensive and this is happening on every request. So of course, there's chance. There's opportunities for optimization here.

D

I I have like one question like: um can it happen that you fire, like 10, requests concurrently, that's going to be executed, or is it like? You wait for one request to finish before you fire another one.

B

Yeah, definitely we can. I don't think right now, it's working like this.

B

No, so I think we're triggering uh requests every time uh and then we're not waiting, but of course so on the front and in principle we we can do all these optimizations, I think from also we can delay. We can wait a bit more right.

B

We could wait two seconds or three seconds after the user stop typing to to load this, but I think this is more about the principle of the mechanism, because once we start to depend on it too much, then we will have to deal with all the performance considerations for the back ends, especially as users use this more.

B

D

To be honest with you from my perspective, it's great especially that like ben sarah can use exactly what you are receiving to build a pipeline graph and visualize that, alongside because you already have the data uh exactly basically like, like you, can have it side by side right now and like be updated, live yeah, at least from my perspective, like the only concert that I may have. It's really like you have, like probably the example that doesn't use includes. Does it use includes or not.

B

I believe yeah this. This definitely if you think, looks enough.

D

Okay, so like it uses local includes, so they are pretty fast to like resolve, but if they would be remote inclusive, they would be probably significantly slower to resource so, like the dealer may significantly longer.

D

So I I at least like from my perspective. I think you should be cautious about how many requests you fire at the given moments, because here, like uh you, are kind of working in pretty much happy scenario that, like the requests, are fast to execute and they seem to be taking like probably one and a half second, uh for this example, but they're gonna be cases that, because rico is gonna, take 20 seconds and now, if you enqueue 20 requests uh every like character person writes it's gonna be pretty severe for the performance.

D

E

Sense to cancel the old requests that we don't need.

E

Would it make sense to cancel the old requests that we don't need, instead of uh like going one by one as a queue uh like if you.

D

Send a request and it start being processed by the back end. You'll- have no ability to cancel that.

E

Well, if the connection is dropped, you can cancel like, uh um like the database query or something like that as well. Right.

D

E

D

You, like I'm, not sure if we're talking about the same thing like, but if you cancel http connection close the socket, it's still not gonna be with you by the by the transport, because it's gonna be receive the help from the socket as soon as you start, writing feathers, which happens after processing the request label and like and like rice, usually do catch the output in the memory before emitting headers and the border to the request. So yeah.

D

I I mean like it's something to check, but like I'm pretty confident that, like you, have no ability to cancel that auntie like it finishes really yeah.

E

uh I'm coming from the go word and the co-word handles cancellation with contacts and that automatically happens with http, so I'm not sure how it does it in ruby. That's what I'm asking.

D

uh I mean on the go like it's much easier like to have this cancellation down to some extent, but I guess it's not always like so direct because like if you have those balancers, it can basically catch. This tcp socket being closed and you may also not receive that.

E

D

Enough all right, the other way really, I think that we've been doing in other places, sometimes is like you, send a cancer request with the separate end point and you kind of keep track the the pure one and you really interrupt execution in between, but it's over complicated. I think.

B

Yeah, so one of the one of the things I uh I was thinking about is that in principle we could also send the changes that we do to the file somehow, but that, of course that requires a lot more technical uh involvement on everybody, because we have to um let's say I I don't know, remove this line.

A

And then I will send you. Let me introduce you because I think it's interesting, um because uh there is certainly a lot of room for improvement here, and we can presumably spend a lot of time on front at the front end to optimize that.

A

But what I'm saying very often is that it's very difficult to optimize something if you do not really understand how it performs and what your improvements are actually doing or whether whether the improvements are working or not, and because this is using graphql.

A

It's especially tricky because it like might be a challenge to understand um the amount of requests that are coming from this particular linter or pipeline editor, and then understanding the latency and the you know tail latency for like uh the slowest requests, because people can put an arbitrary uh configuration here and it can actually translate uh to a latency. That was not expected. For example, when someone is using a ton of external includes yeah.

A

So I think that it's interesting, because we are working on this graphql blueprint here now and I wonder if we could actually include something that would make it easier for engineers to understand how their code actually performs. Graphql wise, like uh you, could go somewhere to the grafana and uh enter like specify the uh the feature, for example like pipeline editor, and then you could see um you know the histogram of graphql execution, latencies and stuff like that, and you know when you have something like that.

A

You might actually then optimize it, and uh there are a ton of ways to do that. But without knowing you know whether it's it makes sense to spend time on that and optimize. That, like it feels a little bit like a premature, optimization.

F

Yeah- and this is brand new right like this- has just come out, so we don't really have any information or reports on where things are broken or anything like people have barely started using it. um So I agree but um working to get the data and then seeing what to optimize, makes a lot of sense, um and you know we'll also have sort of bigger problems with this.

F

Well, not bigger, but like there will con, the issues will continue to evolve in the case, because you know eventually we'll want to have like the visual editor and there's like a lot of stuff coming up. The pike that'll have to do with sort of editing these files somewhat differently, and so um I I think it will evolve. uh The use case will evolve as well.

A

Yeah, so it's interesting and I have a call of camille scheduled for tomorrow to talk about rockwell cummings, our architecture, evolution coach for this, and uh perhaps we will be able to find a solution that will make it easier for engineers to understand how their features that are using graphql are performing so um yeah. I'm hoping that we'll find some solution for this.

B

Cool thanks, so um so what I got there is that we first want to measure the the outcome of this, and then we release that if we need further optimization.

A

So uh yeah, so uh I think that one action point might be actually creating an issue about understanding how this particular feature performs in terms of the cost of executing all these graphql queries.

A

um It might be, you know, a good idea to have some solid data taken out from kibana, for example, and which is visualized on some kind of a graph um to understand that, um and and then you know that this might be a good enough uh evidence that something something either under performs or or should be and should be improved or like it's perhaps fine to keep it, as is for now at least.

G

A

G

Are there any worries that, as people take this, the performance is going to degrade because there's going to be more and more and more people making use of it and they'll be, like you know, an exponential curve of requests?

G

It's not like a single click like one person typing is going to generate hundreds of requests, so each person that starts using it seems like it is. You know there there's a risk there.

H

Yeah, I think I think this is also something that uh everything we're saying about. Instrumentalizing data monitoring for graphql requests is like the best first step, because then, if we have that problem, it's a good problem to have. It means people use our section, um and then we can see how we like.

H

There are multiple way we can solve this and I think just making sure first that this is a problem before we even address it is a good thing and then so that when we start improving, we also can monitor the difference of like speed and like we're, actually fixing a user problem and not just optimizing something that isn't even needed.

A

For now so I added the link to the architectural blueprint for graphql to the agenda. So if someone has some ideas, uh what we should include there over what is already there, please feel free to contribute.

D

I I think, like in any case, any stormy effect of the feature on gitlab should be avoided, so I'm kind of like emphasis with the uh command that was mentioned to consider that, like I, actually have like very cool, like simple tests to perform like if I copy paste another github crm into this, it's gonna generate as many requests as the characters being posted, or will it be more clever.

B

It will be clever if I remain correct. It's a few hundred milliseconds per per change, so this this actually depends on the web. Id editor uh on the same editor that the web id is using, which is called editor lights, and there is some debunking, but it's not super smart, it's, but it's not crazy. It's not gonna overcrowd the request, tab, let's say.

D

Okay, but, like I think like it generates like you, should have some kind of dashboard showing how many requests is being generated because, like I think it's not about, like the let's say, amount like in the window of like the 24 hours.

D

But we want whiter cases like you create a storming effect, because someone can really do a very simple daniel of service type of the attack and like we have very limited capacity of the request that we can process at the given time, and uh we are kind of very cautious about like ensuring that we don't call too many.

D

So I guess like it, attenuates like you, may consider some mitigation, just even like some kind of coding mistake that could create the storming effect in the single project.

A

Limit like one request to graphql per second and uh and enforce this limit on the front end for the time being, until we have some better monitoring tools for grappler.

E

Maybe we don't even have to limit it. Maybe we can have a save button.

E

That's what vs code did in the beginning right? They just didn't, have the auto save feature because uh their file rating wasn't up to performance, so they just have the save button and that's it. I'm.

D

I'm kind of thinking like we should not aggregate. This awesome like ux like save button, is the credit yeah. Yes, it is please.

E

But they're waiting for like waiting one second, trying to figure out why the feature like the request is not getting updated. It might be. Also worse, I mean like, like you.

D

Don't you don't have really white weight like one second, like I think, amigo said that they have the bounce feature right so like yeah, it's 250.

B

D

Like you're gonna cut a lot of peacefuls between 250 seconds, one like the very simple thing, like don't fire another request. If there is another one in fly, this could be like a very simple audition to do.

A

Okay, is there anything else you would like to discuss uh about this lean test, you type feature, or should we like move to the next item.

A

Okay, so the next next item is mine. It's about see, I build architectural architectural work that uh is getting started.

A

I created a very uh simple scaffold on the blueprint that there is no much in there yet, but I wanted to capture a few goals uh we are having before this meeting. So that's uh why I listed them in the merchant quest, but also in the agenda, so the things that we want to achieve. um We want to make them tangible and quantifiable.

A

So uh the first thing that is kind of urgent is um moving away from 32-bit integer, as a primary key for the ca builds table to a big integer that it's a basically 64, uh eight bytes large um um primary key.

A

uh As far as remember, we are over the 40 uh capacity of the 32 bit integer and perhaps around 50 right now. I I do not know exactly. I wanted to check that. It means that we have some runway, but uh the sooner we start working on that the better- and um there were a few ideas about how to approach that.

A

The first idea we had a few months ago was to actually work on the data data retention so that we can remove all data uh from the database for pipelines that we do not want to resume uh and um for four part lines where we do not really want to retry any builds. We would make them kind of degrade in the ui so that you cannot retry a pipeline because it's just too old it's one year old. So there is no point in retrying something.

A

Instead, you should just create a new pipeline in your project, so this would definitely save some uh storage, but there are some inherent risks and uh the another idea is to explore table partitioning. It's a postgresql feature that allows you to partition this table and store part of the data in one database.

A

Part of the data in another database- and I I think this is also very interesting idea, um something that the database team tries to like. You know uh popularize, because it it's a very interesting feature and it can actually reduce the amount of data we store in the primary database significantly.

A

There are some inherent risks here as well uh like whether we are performing a queueing of ci builds using postgresql and sql queries and partitioning can kind of break this workflow unless we are diligent enough to keep all the builds that can be queued on the primary partition and many different, like you know, uh concerns and the challenges here.

A

So um the part you know it's um it's architectural workflow and it's a very collaborative workflow, so cam is going to be our architecture, evolution coach for this blueprint as well, and I I hope, to collaborate with fabio who's, a domain expert and other people uh in the ci and verify area to actually, you know device a solid strategy forward for for this um architectural effort. So yeah do you have questions.

E

I have a question mostly because I'm not too familiar with the space like when you want to scale a table. When do you choose a big integer like we're doing right now or a guid, for example,.

A

uh So basically, we are relying on integers everywhere in the rails code. We we could generate some kind of resource identifier and it does not need to be an integer. But right now in the ui uh we are showing ids almost everywhere, so that lower like default uh solution for primary keys.

E

Okay, I'm not suggesting like using guid, so don't get me wrong. I'm just curious when someone chooses which direction to go to.

C

Do we have any examples already of data being partitioned on gitlab.com? Is there any other team? Any other thing done that.

A

So that's a very good question and we honestly I'm not 100 sure um but um jose fitino uh is a domain expert in in in this blueprint, I'm talking with him quite frequently and trying to source more knowledge about how partitioning works at in in postgres.

I

We we have the product analytics events tables that are partitioned, that dimitri worked on.

I

As one example, so it might be good to reach out to him.

A

Yeah, so um we we are going to have database team members involved in the blueprint as well. So this way we make sure that the partitioning strategy we choose is consistent with what we are doing in case of other tables and other initiatives at gitlab.

A

um But the challenge with the ci table ci builds table is going to still be significant because it's one of the biggest tables with the highest amount of rows in there. um So um tackling this problem of uh primary key and and row identifier. It's still going to be challenging because we cannot simply change the type uh it might be actually necessary to add a new column and then migrate the data and then, after the next release, to switch, for example, the primary key index.

A

So that's um that that might be challenging and uh a significant amount of uh domain knowledge from the postgresql areas required. So there's also an external company involved. I I think that uh I don't remember how this company is called, but we do have a consultants uh available um to talk with them about feasible strategies for things like that changing the type of primary key.

A

So these people are also going to be involved so that we, you know, uh uh are certain that what we are doing actually makes sense uh based on how postgresql works internally, because it turns out that it um I had this the question a few weeks ago.

A

Is it possible to just alter, table and change the type is that going to work, or is that going to bring the postgresql down and without knowing uh in you know, in details how postgresql works? You cannot answer this question and, of course we are not going to check it on production right, so uh so uh that significant domain knowledge is required here. So we are involving consultants and database team members with domain.

A

Expertise any questions and concerns about the data model for shared, builds.

D

Table all right, my only comment was good luck. It's gonna be pretty crazy, complex task.

A

Yeah, but uh I I think we can do it. We just need to devise a strategy for iteration and not closing doors. You know not removing data until we know that we can safely remove data and stuff like that. So that's going to be interesting, and uh so I'm quite excited about that. So if anyone would like to help, um we still do have room for domain experts and people that uh might want to work on this. So if you would like to help, please feel free to reach out to me my own slack and.

A

Okay, so the next uh item is the new grafana dashboard for verify and continuous integration. If you click on the link, uh you will see the dashboard for the continuous integration group.

A

Actually, almost every group inside verify has their own dashboard, but the continuous integration dashboard is the first one we started extending and at the bottom of the dashboard. You can see this panel panel called metrics for for build logs and that's uh what we have extended this dashboard with um everything else.

B

Records, maybe you can share your screen with the thing you're looking at yeah.

A

That's definitely a good idea, so, okay, so so that's the dashboard and everything in uh like this panel panel, this one this one, that's something uh that comes from um a generic dashboard for a stage group. uh When we go to like verify testing, I think verify testing.

A

You will basically see something very similar, but there is no this component that we recently added and- um and we added.

A

So we added the span panel for matrix for for build logs, and you can see that we have like the counter for the rate of build, looks streamed um invalid and corrupted build logs detected all my tweaks and throughput of build logs processing with bytes per second, and uh this is uh something that uh you know it's a custom extension.

A

So every team in verify, and not only verify in other stages as well, can actually go to their dashboard and extend it with all the metrics that they want, and, um in my opinion, it's a nice idea, because uh you have one place with the most interesting metrics for your group and uh extending it it's like not super complex.

A

Let me show it to you how this.

C

Looks like english is the is the common part, so some sort of partial that is kind of included in all other dashboards like from for verify, what's copy paste exactly so.

A

uh Let me show this file view file, uh history and uh like this is how this file looked like before me, contributing to it it's just the stage group dashboard for continuous integration, and there is like the trailer at the end of the dashboard, and this, like is like a function that generates your dashboard. If you want to extend it, you need to put all the panels and like definitions uh between line five and six, and it looks like exactly the same for, for example, verify testing.

A

uh So uh so I added a few panels, so you can see that we need to add panel, then specified layout. We can add some text. We can add some basic time series and uh you of course put a prompt ql in here uh as a query and graph is going to handle all the intervals environments, um and this way you can easily.

A

Add your metrics for your group into the dashboard that you most frequently is going. You are going to look like that at so uh yeah doesn't make sense. Any questions.

C

Do you know where the common template lives for the verify.

A

uh So everything is it's uh defined in this project. It's uh the runbooks project maintained by uh our infrastructure, team and sres. So um there is a common dashboard somewhere. I don't exactly remember what file defines it, but it's somewhere in here. I can't remember you would need to find it. But basically the idea is that the common dashboard is being maintained by the scalability team, and uh you know stage groups can extend it, but the common thing is it's being maintained by this call of duty.

A

They make sure that what it shows is accurate and that they are getting metrics from uh like valid sources, because under the hood in the race application we are um defining feature categories for every site, worker for controllers and stuff like that, and then it's being aggregated by the tooling written by scalability team and pulled into kibana and displayed on a stage group dashboard doesn't make sense for me.

C

Yeah yeah, I was just thinking. Let's say we want to start. uh You know, add a new chart about uh pipeline iterations, I'm just making like an assumption, something that could be um of into of interest to all the groups within verify.

C

So would that be something where, like? I would like, maybe to edit to kind of promote one chart to the verify stage rather than I think only to continuous duration.

A

So uh yeah I get what you uh so if there is a common ancestor for the stage, not only for the generic stage, that's a very good question and, to be honest I don't know so. This is something we would need to uh ask scholarly team about: okay, okay and uh yeah. I think it's uh actually a good idea and something like that might be useful.

A

However, I just suspect that something like this has not been factored into the first iteration and, of course everything is working progress, so uh with a good reasons uh provided scalability might be able to extend it and lose karaoke.

C

And how did you find the the process of updating the uh adding new dashboards in the new data dashboard.

A

In terms of how to do that, the.

C

A

uh Like the workflow, it's very simple you you clone the run books uh project you make modifications and submit that merge, request for review, and anyone in every sre can make your changes and changes and merge them. It's probably a good idea to ask bob or someone from the skull big team instead of like every sre, but because they are like building this generic dashboard stuff um yeah, and if you would like to learn more about how to actually do that technically, like uh what files to open how to push your changes.

A

There are there in the dashboards directory, there is a readmemd file and some documentation is available inside and we plan to add more metrics for the uh for the c continuous integration group. I created an issue about that today and listed at least a few that we can add. So uh I plan to add it for you, but if someone would like would like to learn how to do that feel free to take a step at one of the metrics, and you can collaborate with me to get immersed.

A

uh Yeah, so uh I think that we are almost at time um so there's the last point. uh I had about finding an expensive query, because this is something uh we we have seen recently that um database engineers, uh I identify a query that is kind of expensive or is running very frequently and contributing to a significant load on the database and asking stage groups to like improve the performance or do something so that we do not run such a query.

A

That often- and sometimes it's quite challenging, to find this query, so I created an issue that could perhaps uh help with that, um because right now we do not have a good way of finding uh where the query stems from or like which area of code is actually actually executing this query, so the infrastructure team is going to give us the query, but if it it's like very generic query, it's very difficult to pinpoint the place in the code base that is responsible for the biggest load.

A

Sometimes it might be simple fix like adding an inverse of. Sometimes it might be more difficult, but you need to know which area of code base is actually causing the law to optimize it. So there is this issue that might actually help with answering this question and uh yeah. Are there any questions? We do have like a minute more.

A

A

Oh okay, so thank you very much for attending uh the call today and um looking forward to seeing you in two weeks have a great day bye.