GitLab Package Group, 26 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: JWT auth API availability and database load balancing

Description

Related to https://gitlab.com/groups/gitlab-org/-/epics/5215.

A

Okay, so um some of you were on the previous meeting related with the container registry and high availability uh other knots, including myself, but this one is to talk specifically about the the rails api and not really the registry. But given that the registry depends on the rails, api for authentication, that's also related with higher availability.

A

So one of the things that we discussed and maybe I'll share my screen. So it is.

A

A

Okay, so one of the things that we have discussed when the concerns around the registry availability arose was the availability of the authentication api. So we created an epic to investigate. How could we improve the availability of that? More specifically, the jwt out route on the rails api, and this epic here has a long trail of discussions with stan, camille and others as well, and we got to the conclusion that, uh contrary to what we were expecting, the api route uh doesn't rely on the database replicas.

A

uh Instead, it requires a connection to the primary database. Also, the only required queries are select, and this has impact on availability, because whenever there is a database failover database incident due to a failover, the authentication api won't be able to serve any requests if the primary is down and for as long as the primary is down.

A

um So in response to this we have created and.

B

I just can I just raise something at this point before we move on any further, and that is from an infrastructure point of view. We we are not able to, and we straight up can't set up certain routes that have higher availability than other routes inside the monolith.

B

So you know I can see, there's been a lot of discussion on here about having this particular route inside the monolith having a higher availability, but that's not the the service level that the infrastructure team has. We have it for the monolith and we can't sort of provide more availability for some endpoints over others, and if we want to do that, it has to be outside the monolith.

B

So I just want to raise that because, because it feels like that needs to be said early on and and obviously, if we're going to build this up and make it stronger, there are ways that we can do that. But the reason availability goes away is not because that endpoint fails it's because the deployment breaks or like we had last week, we had sick faults that take rails down. It's not something where we can protect one single end point and give it higher availability.

B

If we want to do one endpoint, we have to do the whole application and, frankly, if we start taking the whole application beyond like 99.95, you basically got to go away from daily deploys and that has a very big impact on our ability to to iterate. So um that's just from an infrastructure point of view. Like you know, we can't do magic things for specific endpoints inside a inside a.

A

Yeah yeah, but that's that's another point, and it was uh one suggestion from camille is that we could discuss uh using a separate fleet just for these rods, um so it from a separate fleet, so it wouldn't be affected by other uh long-running uh requests uh causing causing slowdown on the server.

A

So that would be an option and the other option or a compliment would be making these routes use only database replicas to to get the data uh because those are more available than the primary, which is a single server, and that's basically, what we have raised for discussion. One uh see if we can make these rods use only the database replicas and not the primary server and consider using a separate fleet to serve this road. Specifically, those are the two main points.

B

Yeah from I mean it's, it's not just the separate fleet right because there's also like how do we deploy that? Do we deploy it at the same time as we deploy everything else, so I just want to you know if we setting an expectation with the customer that we're going to have like five nines or six nines on this, like that's, there's a bigger discussion around that, because that's very difficult technically for us to actually do so.

B

I just want to set the expectation that you know that it's a difficult thing um there, but I, but I'm all for making it more resilient. Just that's the other side of it. Yeah.

A

I I think getting it to four or five nines would be good, but uh we are not well in. In the last months there were some where we were below the sla.

A

And by a bit so if we could get the api uh to be more resilient and more stable and always match the sla which is 99.95, that would be a good first step, but at least from the way I see it as long as we depend on a single database server to serve these requests, we are always prone to database followers, which can take a long time.

B

Do you know why it's it goes to the primary.

A

Yeah so uh that, first, because on the on the load balancer code that we have from what I saw, um we only uh send a few methods to a width only connection and for for the remaining ones. uh We always default to a read right connection and from invoking the routes with with the primary down.

A

We could see that there are a few methods that are required and right now they are all going to the to the to the read write connection uh by by default, um so one one of things is trying to identify if uh any of these methods can be implemented on the load balancer to use a read, only connection if it is possible or not- and there is other things as well. I found, for example, that some specs require the the api to use a primary server and one of them I linked it here.

A

uh This relates to an issue from from two years ago, where apparently, there was some replication like on on ci bills. uh Basically, when there was a builds, the the token use it for that bill sometimes was not replicated on time to the replicas, which causes the pipelines to fail, and I also found found an mr from camille because of this, which made this authentication rot when using the gitlab ci token to stick to the primary database server instead of going to to a really long if that would be possible.

A

But at the same time I also found that this relates to a previous issue due to replication lag. And apparently that like was resolved before those mrs that changed the routes and it was related with the ssl compression uh so apparently disabling, that fix the the replication lag. uh But even though uh this mr and another one were merged afterwards um possible to prevent this from making it again.

C

So like from time to time, we still have like very high replication luck, and I know that, like we saw that on various places of github, like issues comments being like shown with a delay, so I think, like this mechanism still probably needs to be present in some form, but maybe it doesn't have to like to do a first try on the primary because, maybe maybe like I mean like it's also like something to consider, because like now, can someone use the fact that we have replication like to perform uh something?

C

Some authenticated action like knowing this like this fact from on happening. So one thing to consider like with the replication like? Is it safe really to perform like this? Let's say: github ci token authentication on the replica, where there is a case that this token may be already like, may have already revoked permissions at that point, that you are tracking that because, like uh if you have a replication like you, are not sure if it has or doesn't have. Probably this is not the issue, but this is something also to consider that.

C

Maybe if we like tried that against replica what's gonna happen, if we revoke the permission for some given context, there may be some case that this may lead to some, maybe small or maybe non-existing security issue. Because of that.

B

So camille is there a way that we can sort of say that if you remove a token, we have uh like we'll say that that within 500 milliseconds or a second or two seconds that client will no longer get access and not say it's like an instantaneous thing and then kind of coupled with that is there a way that we can make database calls and on the database call almost have some sort of declarative way of saying only uh you know only allow this.

B

If the database lag to this replica is less than that defined well set out, um you know lag because then you, then you can sort of safely say you know we'll use the replicas. The other thing from a database point of view is, you know we have a lot of load on the database primary and it's a limited resource and we're actually running out of resource on that quite quickly and anything that we can do to move um queries off. That is is really really helpful. So there's another aspect to this as well.

C

Yes, so I I fully agree, I think, like we can find the solution like to make to make it like still at her, uh because, like I mean it kind of applies to everything like I know that we have a tracking of the of the replication lock. I'm not sure if we have the ability to see that in the let's say seconds, because this is like the binary pointer offset in the binary pointer of the log.

C

So it may not give us like a duration kind of of the interval in the application.

C

I think there's a way in I. I may be wrong like how it's actually implemented from from what I remember. This is like the offset in the binary rock basically replication mark.

B

I don't know what the default there is a default, like you say, like there's a certain threshold, and when you get over that threshold, the the secondary will not be used, and I don't know what that number actually is, because that might be. If that's one second, then that would probably be sufficient to be able to say. Well, if you delete a token, you know, clients can still use it for one second, and that's just the way that you know gitlab operates.

A

We could also try retry the queries on on the secondary uh servers if they fail, if a token is not found. For example, we retry on the secondary again, at least once uh and then, as a last resort measure, we could go to the primary before uh returning any horse response to to the client. That could be another option as well.

B

Yeah, there's sort of some ddos potential options there as well. If you send fake tokens.

C

I'm kind of thinking uh jaw, if am I pronouncing your name correctly, yeah, it's close, it's close. So what is what is the proper pronunciation?

C

Okay, it's pretty hard. Sorry.

C

Okay, so I there is like one very important aspect that you kind of notice, that, like we have methods that make us use uh the primary and, like I'm kind of thinking, that it kind of did, go unnoticed for a very long time and like basically, even if we think about our logging. We have no way in our logs to know exactly how many requests use primary versus replicas.

C

So I think, like my really first suggestion would be based on, like your findings of the methods, is to actually introduce a way to log that into kibana to know how many I mean percentage of the request fallback to use the primary, because of maybe something new. Introducing the code base that we did not anticipate.

B

So there's a there's, an attribute called db, counts and db right, duration and reduration, and what we want to do is split those into like db primary counts and db secondary counts. You know so just divide them between primaries and secondaries.

B

um I think sean mcgivern did that work.

C

So so andre, like you're working on that right on splitting these cards,.

B

Sean, no I'm not, but but sean did the original work of adding the the db counts and db durations and um the precedent for redis is we actually have like db?

B

Sorry, redis, underscore cash underscore read and write count, so we'd want to do the same thing we're doing on redis, but for the for the database ones, where we split between different roles, um I'm not working on it. I don't know if there's an issue for it, but it would be. It would follow that.

B

You know, president, that we.

C

Set for ready, yes, so so, like we later introduced db, cached written write counts to have more insight, but we still don't have information about whether this is replica or primary.

B

I mean if we had that it would really help with a lot of work in trying to find out what's hogging the primary. So that would be super useful.

A

Okay, I'll create an issue for that, but in this case it it. It also seems that this is mostly related with some methods not being implemented or identified as eligible for being served by a replica on the load balancer code uh I'll share my screen again.

A

So this is the the load balancer um code and we can see here only the methods listed here uh are surveyed with a read-only connection uh and anything that it's not here will be will fall back to uh to uh read right connection.

A

So this is basically the reason why for this route it requires a a primary and a read, write connection, because all these methods are are called during requests and none of them are listed as safe for sticky reads yeah. So this is another thing that needs to be investigated. Whether or not those methods can be made.

A

B

Like I'm not aware of what those methods do, are they sort of read-only type things? Are there things that that foreseeably we could send to second heroes.

A

We have not investigated each one of them. This is one of the the action items for for this issue. uh There are some which are uh just selects. uh I saw them for, for example, migration context. uh This is just a select about the the state of migrations uh quote, uh I believe, is just select as well uh this one I saw the code and I believe it's the same, I'm not sure about prepared statements.

A

I thought we didn't have prepared statements enabled because of piggy bouncer, so I'm not sure why this is appearing here and the same for visitor. So we need to look into each one of those methods to see if they are only selects and therefore they could probably be sent to the to the replicas.

B

I would suggest that you want to get the instrumentation in first, um that you know the splitting out the db read cons and db right cards into into primary and secondary roles, and the reason is because then, when you make this change, presuming, that's not a hard change to make, which I don't think it will be because then, when you make this change, you'll be able to see like what impact it's had.

B

um You know on on the database immediately, and so, whereas, if we, if we make it afterwards, then it's much more difficult to guess that.

A

Yeah another thing is that it seems like right now. We can only make these changes globally. So if we change the methods that we'll use a read-only connection, that will apply to every single route on the api and the rest of the application.

A

So if there could be a way that we could change that, just for a specific endpoints, it will be safer as well.

B

Yeah and also it would allow us to know what else is triggering these things to some degree.

C

I mean like even like monitoring, could be super helpful to see how many of like requests and distribution we could maybe consider moving to replica like reduce the load and reduce the contention so make the jwt out of even more happier in this case.

D

I have one question uh to xiao and hailey, because uh I know that there are a lot of things in flight that the package team is supposed to handle. We do have this postgresql metadata database for container registry.

D

There is this idea of building the on-premises caching proxy that we are still exploring, and now we are thinking about optimizing, uh jwt and point for registry and obviously like all of these are very important problems to solve, but I'm kind of worried of like the capacity of the package team, because we seem to take a step at many different problems at once, and uh I I feel like with the current capacity it might be. Actually it might be difficult to bring anything to condition. So what is your respect from that channel?.

A

Yeah this one about the api.

A

It doesn't really have anything to do about the registry itself as an application as a google application. This is just about the rails api, so this is stuff that the rails engineers in package can help with. uh So this is not something necessarily that myself or ellie uh would do um so. That's for the authentication api and then for the registry. Well, the metadata is already in progress and close to completion and regarding the on-prem cache proxy.

A

uh We still don't know if that's um desirable from customers or not, and even if it is uh that doesn't mean we will implement any changes or getting it ready um in the next couple of months. So right now, it's mostly around finding solutions.

D

Our exit answer is my question: yeah.

B

I will say that there's definitely an argument to be had sorry um that that this work should be done by the database team or the scalability team. um Obviously it's a matter of like finding when they can slot it in. But you know if you're struggling with the amount of things that are that are up in the air like it is a natural fit for one of those two teams. So uh it's probably worth engaging with them as well like.

B

I think andreas would have been a really good person to invite on as well, because he has a lot of insight into the database balance.

A

Yeah, because this affects everything, because.

B

It's going to be by everyone, so yeah.

A

For me, it also makes sense if anyone else more experience can pick up this work. It will save us.

A

B

Sorry, I kind of, are you going to say something heavy.

E

Oh just to draw to short tim puts it put in the chat. There is a link that we are going to try to grow the package.

A

A

And in terms of uh using a separate fluid for a specific rod, have we done that for any other route, or do we have a single fleet for every single one of them.

B

So we we have, we, we have a h, a proxy config that is terrifying, and it does things like regular expressions on dot gets things that that you shouldn't be doing but effectively through that we root like get traffic to you know it's all the same rails monolith, but we have web api and git and actually web sockets now as well, which is again the same rails monolith, but we're rooting it separately.

B

My concern is that you know generally the things that we lose.

B

The 0.05 percent on are not because of uh you know, making more bulkheads like breaking out a separate head, which is a sort of bulkhead generally, doesn't help those kind of things, because the things that go down like you know, are the database or maybe a bad deployments, um or you know things like that, like the sig faults that we saw last week from from garbage, collects compact problems and- and those are not you know saved by by the the segmentation and and when we have more fleets, there's more things that we need to to juggle.

B

You know there's more plates in the air, so it's not always better. um It is. It is an option, but it definitely has to get sufficient traffic. Having said that, there's a lot of traffic that goes to that jwt endpoint. So there's probably an argument to be had for it.

B

A

B

To be possible, just on that, I think at the very least, we should move it across to the api endpoint, because at the moment it goes to the web endpoint, which is wrong.

A

Okay, so it's the web fleet that is serving the the specific route is that.

B

It yeah the api endpoint effectively does everything that starts with api. It routes it to the api, endpoint and jwt auth does not start with api. So for historical reasons, it's always gone to the web endpoint and it continues to go there.

A

Okay, and do we know which one of those has more availability on average? uh Is it the same thing.

B

They I mean it's: they they both you know horizontally scalable fleets. So if we and we have the same monitoring on both- and so if we see that one of them is running out of workers, we'll increase it, um you know. Ideally they should both be healthy. um You could go. Take a look historically at our availability on the sla dashboard in grafana and go see which one's better. I would guess that is probably the api.

A

Okay, so that's a follow-up I'll create an issue for that.

B

I I think almost regardless of whether the api is better or not. We should move it there, because it's it's less surprising, like everyone expects it to be there and if there's availability problems on the api, we should address that separately. I don't think they are, but it should. You know the least surprising thing is good to be on on the api.

C

I'm also thinking that on the web we kind of perform much more heavy uh calculations really and like that are significant significantly longer on rp free today, on average should be significantly faster to execute. So this is also much better for the latency perspective of jwt off if you have like more even evenly distributed durations instead of like very spiky behavior as its own web, so I'm kind of like skin. I fully agree with andrew that app seems like basically better fit because of the the type of the compute being executed there.

C

As for the distribution of the durations.

D

I I think that we are right, but I also think that we should not assume that without looking at graphs and metrics, because we might be surprised by how people are using api, so yeah, I think that we should, you know, actually use data to prove, go and confirm these assumptions.

B

As you're speaking, I'm getting alerts for the web.

B

A

Yeah, I I just pulled the metrics and, for example, what I'm seeing right now, yes 99.91. While the api is 99.85.

A

So yeah it's better to have a look at that and see historically, which one of them performed better. But, as andrew said, it would make more sense to have these rods together with the rest of the api.

A

A

Okay, so I guess, as follow-ups are always an issue to try to add information on logs, if request is using, the primary or uh read-only replica uh also consider consider moving um this rod to the api fleet instead of having it on the web fleet, and then we should probably team start talking with with the database team, see whether they have time and are interested in helping with this and scalability as well to see if we can indeed have these rods served only with with replicas.

D

Okay, I could follow up with them.

B

I've got to run, but um it's great meeting thanks. Everyone yeah.

A

Does anyone have any other topics or questions.

D

I can hijack a few minutes because uh there is um on the on the previous call.

D

We agreed that we need to process logs from our marquee customer, uh I'm not going to use the exact name of the customer, but um uh basically we need to do some work to process these logs and to send them to bigquery, and there is uh some engineering work that needs to be done there and, uh to be honest, I'm kind of struggling to find time to do that, and I wonder if anyone actually has a little bit more capacity to work on that with me and collaborate together on this.

A

I think at least we can open an issue under the epic for the api availability and uh if anyone has time to tackle that, you need to look at it later.

D

Yeah, okay, can we introduce epic I'll? Add uh I'm showing there because there's no issue for this yet yeah.

A

That is not one, so it's the epic that I shared on the beginning of the recording now so just share the link here.

A

So could you open an issue for that? Yes,.

D

I will thank very much for for sending video. I will.

A

Create okay, if no one has any other questions, I think we can go a bit sooner.

A

Okay, I'll stop the recording thanks everyone. Thank you.