GitLab Package Group, 12 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Container Registry - High Availability discussion

Description

Discussion with Distinguished Engineers about making GitLab Container Registry highly available.

A

Okay, so hello, everyone um thanks for joining. I know that it's not the perfect time for some of you, uh especially team. So I'm sorry for scheduling that that early for you it's difficult to find a good time for everyone.

A

uh So uh today I hope that we'll have opportunity to talk about container registry and higher availability, uh and that's uh that's something that one of our customers actually requested and uh unfortunately we are waiting for zhao. I would really like him to join, but there is an ongoing uh production incident related to container registry right now, and uh he told me that he might be a little bit late. So there is an agenda document. I will try to take some notes and.

A

Can you make it editable, okay, yeah? Definitely why it's not in the table! Sure ah yeah! So that's change. Go to the viewer. Editor done yeah! It's it's fixed right now,.

A

Yeah, so I just wanted to start that uh this is a very interesting problem and it's kind of a difficult problem to solve for the customer without knowing exactly where the problem is. I know that the team and uh other people, like john hampton, perhaps uh actually had this opportunity to talk directly to the customer um but uh yeah.

A

I I guess that you know camilla andrew, might actually have questions about what the the problem exactly is. So um um should.

B

I summarize the problem would that be helpful, that they had, or you already you're already aware of it like, let's, like you know, over communicate.

B

So if you could summarize it again yeah, so we had a few weeks ago now, maybe a month and a half we had, um we had an outage that affected the container registry and for it ended up being in total about seven hours, and it just so happened that one of our major vip wireless customers was uh the iphone 12 was released that day and they couldn't pull anything and they couldn't pull an image into their kubernetes pods, so prevented them from scaling up and they ended up losing a lot of sales, so they were very upset, but in general they have this idea that they want to rely their their gitlab.com customers.

B

They want to rely on gitlab for the management of the platform, but they would look like some additional redundancy when it comes to their container registry, so they it's. It kicked off a bunch of projects. uh We've been considering things like. How do we make the services around the registry like the auth api, more reliable? We've been considering things like um another issue that they brought up was potentially having a push-pull mirror locally or in multi-cloud.

B

So not you know as an option for for them to provide some additional done, redundancy and so where that's all left off is we've kind of told them. We have some immediate plans that are actionable and we've been making changes to the runner, we've been making improvements to the dependency proxy and that we will present to them these blueprints and a plan for this future state, which could include this push-pull mirror or could include higher availability in other areas of the services.

B

So with all of that in mind, um the the engineers psychic lab have been putting together this blueprint, joan and gorish and haley have put a lot of work into and thought into this, and so now I this discussion is for all of you to come together and say: okay, what's actually feasible and which direction do we want to go.

A

Yeah, thank you very much for for summarizing that and uh coming into you uh have you had opportunity to actually read the blueprint or, like would be. I.

C

I haven't had a a huge opportunity, so um I'll be honest, I'm sorry about that. It's the second day back.

D

I mean like for from me we like looking over the blueprint. I think that general idea is like that. We somehow introduce a component that can like be intelligent and cache or like provide a mirroring functionality to the container registry, and it would be, I, I presume, uh be run by the customer within like the control, and they would have some kind of notification system pull push to uh have some kind of consistency. I I think that this is like the high level idea behind the blueprint, and I correct team and a highlight.

A

Yeah, I think that that is correct. I see that uh haley joined so.

E

Yeah, um so I think the big thing with that is: it's not push pull it's uh it's pull only and that uh introducing pushes um really opens up a can of worms since uh tags are mutable and then we'll have the two sources of truth for tags. If we enable pushes on this local uh proxy.

B

Thanks ellie, sorry about that, I misspoke.

A

Yeah, so I had opportunity to read the blueprint uh and, to be honest, I I feel like I'm not completely sold to uh to this idea, although I feel like that might be a very interesting uh proposal. um The concerns that I have and I might be totally wrong. So that's the reason why camille and andrew and like other people are here, uh is that it's quite a complex solution and, uh to be honest, I'm not completely sure if that is actually going to solve the problem for the customer.

A

um It's a complex solution, because, uh first of all, uh we don't really know how the infrastructure of the customer looks like. We know that it's kubernetes, but you can model many different things with kubernetes and um uh it's um it's always a little bit of a challenge uh to introduce a highly available service in a distributed environment, and we do not really know if a customer is going to do that successfully, and there is always this component of how to actually model um a highly available, uh presumably eventually consistent data store in kubernetes.

A

From what I understood from the blueprint, we do not really want to use object, storage because object. Storage is not reliable enough right, so the idea is to actually build some kind of a proxy that has this peer-to-peer data exchange mechanism built in and uh I feel like. We are stepping on the this. You know difficult territory of uh tackling consensus, algorithm and stuff.

C

The first concern I have which I've already raised with you is that um I think that as soon as you start saying like five nines, that is not something that that modern infrastructure is built around. So you know I pointed you to there's our ebook. You know. If you look at google cloud amazon gcp, nobody would say that they're gonna give you five nines of availability, um it's kind of like interest in a bank. If your bank was like 99.99 or an investment rather than a bank.

C

If you had a 99.999 chance of of uh of getting your money back, then um the chances of you making any money from that are very low and that's the same way. The risk that you can take if you're trying to keep an availability of five nines, is almost nothing and so that that holds back a lot of stuff on gitlab.com.

C

So if we want to do like hourly deploys you can't do that if you're looking at five nines, availability and and that will slow down, you know huge sways of the product, including not only the registry but also say the api, because in order to get you know a few seconds a month, you've got to make sure that, like really tight, qa is done, nothing's going to fail and what's better and if you read like the sre book, what it says is is just focus on 99.95 and that or whatever suits your business, but don't don't go too too high.

C

So there's a whole lot that I can talk about that. But I think, as a first point, we really want to say like like five nines is not a realistic target that we want to go with, but at the same time maybe what we can do is um build something that's that's very uh decoupled from the main application and that will, through that and redundancy on that we can kind of team towards that, but not kind of build it into the application.

C

um I would say like if, if it was possible to maybe use three different object storages, you know, maybe you could have if you were really wanted, those five nines.

C

You could have three copies of this, and if one of them went down, you know the say the aws one could still stay up or the azure one could still stay up. That would be something like I, I think that'll be much simpler to get the availability through redundancy rather than through. um You know, distributed protocols or anything like that that that is a whole rabbit hole which I don't think we want to go down and to quotes the uh you know value it's.

C

It's not a boring solution um where redundancy is and there's a lot of very available software. That's just built on top of redundancy, so I think we should try and focus on on that, rather at least without having a lot of knowledge of the solution yeah. I I agree with you andrew.

A

E

Yeah, so I think what we've observed. We also have issues open for the gitlab.com registry service um kind of pursuant to this goal of more uptime higher availability um and there's a you know. There's the issue of the registry depends on the gitlab api to authenticate and then, besides that, there's um gcp.

E

So both of those have 9.95 or so uh I'm not. I don't have the exact numbers in my head right now, but so I think that's on that front of just making the servers more available. We have that kind of a two-prong thing. It's about to be pronged once the metadata database is up as well for the registry um and what we've observed um is it's the weakest link in recent recent outages has been the um api for authentication. That's been lower, has a lower availability um than the storage bucket.

E

So that's what we're seeing on that front? Hey.

C

You can I, I have no information on me right now, but my uh when I think back to the incidents that I've been involved in a large number of them have been down to gcs, storage um and- and you know, availability problems on that. Certainly like the last three or four registry issues that I've built. The incidents that I've been involved in have all been that. um So do you, do you have numbers on that or like what did you use to base that on.

E

I had to find those issues we we did a survey for the past little while and it was not by a big margin, but it was the api that went out a little bit but they're. Both um nothing sticks out um dramatically.

E

I would say.

D

I'm I'm kind of I'm kind of wondering because, like I write that comment as well like like, we consider happy to be like something significantly wider than registry does require. So I I was kind of like proposing at some point that, if app is a problem, is there some like way to disconnect like this jwt of end point from the rest to be like a self self, like uh additionally managed component?

D

If, if this is something that like, we have so much emphasis on sla, because right now like uh it's in this, like the amount of the request, this endpoint receives it's like pretty high magnitude, but it kind of falls into the common bucket and basically, the noisy neighbors can really heart this availability of the registry and maybe like one of the.

D

If this is really like the problem that we are facing with the appi, maybe one of the the first step is like to disconnect this endpoint from, like general epi feed or like the web, switch to the separate fleet the same uh way, how we have like the git handling separately being done to web and happy if this is so important component compared to everything else.

A

So I think, actually, it's possible to actually improve the apa endpoint for the container registry authentication, but I would like to still challenge this idea that we need to dramatically improve the availability of container registry, like that. What what we do, in my opinion, depends uh on what the problem of the customer is exactly like. I'm still not able to you know, have an answer for uh such a simple question like. Is that all the images that they have that they need to have highly available?

A

Or is there one or two images for just the services that are being updated most frequently that they need to, like uh you know, have uh in a hot cache all the time like.

B

Well, one of the things that they they push. Many many images they change turnover pretty quickly, but one of the things they were talking about was even if they had some of the images cache that would have helped them because they could have used an older image that was in the cache and that would have been okay.

B

And when we were talking about five nines. I think the contact at the customer was saying they don't really want five nines, like they understand that that's not may not be in our zone, but they were really worried about the time to resolution like if, if we had 99 availability, but we never had an outage of more than a half hour or an hour and and things were held in the cache that would probably be okay with them.

B

um So yeah, I don't think that they're saying you have to hit five nines, or else we're going to go elsewhere. I think they're just saying we need to have options. We've got redundancy.

A

But this is not about updates so that within the 30 minutes they are pushing images and they need to have them available for their kubernetes cluster. Or is that just a matter of pulling from the cluster and.

B

Pulling pushing definitely affects their builds, so if their builds are broken, that's frustrating, but they could live without that for some time the problem was, they were trying to scale up the app and pull images and they couldn't do it and they couldn't, and that was real. That was really the core of their concern is if, if something happens, and they need to scale up their site, they need to be able to pull reliably. What.

A

Is the reason why they cannot use kubernetes mechanisms, or even like add-ons like there is a software code called cube-fledged uh a service, a custom crd? That makes it possible for you to configure your images. Caching, on quarantine's nodes.

E

um I asked a similar question and I think one of their concerns was the amount of images that they use and the size of those images. Just just the storage capacity there to have like a warm cache and all their um cubelets would be too much.

A

If the storage capacity is a problem, how the on-premises caching proxy is going to solve their problem, because in in that case, they will also need to store everything locally in the same uh storage like context that.

C

I think haley you were you were you were describing like on node caching right as opposed to like per custom cluster.

C

Yeah, okay, so that's that's why the cash hit rate would be a problem because you know it's per node, but but is there not some kubernetes cash or at least some sort of container caching solution? That's like on the cluster that could be shared across because you know I can imagine. Even if you have a lot of images you should be able to set aside. You know a few terabytes for like a local cache.

A

That's that's common to all the gitlab because we are running exactly in applications. So how does it work in our case.

C

Well, we we use registry on devkit lab.org to boot, everything up so, but it's as far as I know, it's just a standard distribution. You know our container registry there's nothing special about it, but.

D

Yes, it's kind of because it uses the same object style. So I think that just a different bucket yeah but like this is like a single point of value for us.

C

Yeah, I I I and obviously we have to use a different machine or different cluster, because it wouldn't be good. If, if we took everything down and then we couldn't spin up new machines, um but I I do think it'd be good if we had like a product solution, even if it was like some third-party piece of software, that we could encourage customers to like run a cache locally, because it obviously takes load of gitlab.com- and you know the bigger their cluster gets.

C

We don't want to be kind of taking all of that hits of them. You know having to scale up um so having something there. I think is ideal. It's just. We want to build it like really straightforward basic technology, at least from the infrastructure point of view.

D

I'm kind of thinking because, like we have a container proxy in the in the github asset feature like, could they basically run their own github? That would be configured against different object, storage provider or whatever and basically request images from their gitlab. But this gift output basically requests images from our github.

A

We do have gitlab like it shouldn't work in a similar way. Sorry, we do have gitlab geo. Doesn't it work like in a similar way.

D

Yes, but like gta 0 would require like us to clone the whole github.com, which is not the same, but for them it's more like the selective approach.

E

um So we've had a few conversations with this customer and they've been really um really adamant that they don't want to run gitlab infrastructure as much as possible, and I know the solution is partially that, but that's something they've expressed very clearly so.

A

E

D

A

Interesting because I I still have questions about um about the storage, because from what I understood, they want to have a highly available solution. So if we build this on-premises reverse caching proxy for container registry, they would need to run it in kind of a highly available manner.

A

And uh um how do we approach storage in that particular case, because if they also want to run it inside kubernetes, it might be a little bit tricky running it in a highly available way. Outside of kubernetes might be tricky as well, but it's always tricky when we start thinking about where they are going to store their cache.

F

D

Kind of thinking that this is like the unsurvivable problem with that approach like if you have the high dsmr caching service, that, like you, base your availability on, it's just like waiting for the disaster to happen at some point like it's. If service, for whatever reason, fires and github.com face at the same time, you just don't have any data, you cannot really it doesn't. It doesn't prevent you on the authors. It's casing is good to like to reduce the load, but uh if things go wrong, they kind of go wrong like in cascade way.

D

So my worry that like if our objective is like to increase the availability, fmr caching seems like the not the right solution to that problem.

C

I mean so just to challenge that a little bit like if you had um the most boring solution right. You had three caching nodes that were each in a different region and you actually just went with like block storage like really really boring, old-fashioned block storage behind as the as the cache thing, which is pretty reliable. Like I can't remember the last time we had a big block, storage incidence and you have three caches and you know hits are randomly pushed to different caches.

C

If one of them falls over, you know that's what kubernetes is good at they'll direct, the request to the other two caches. We don't need to build any complicated mechanisms for kind of retrieving between the caches or anything like that, and if one of them one of the nodes, goes down, it just gets taken offline and if it just so happens that you know that there's an image request that that isn't on one of those nodes, then so be it then you know, that's that's and it and obviously.

D

I I know but like there is, there is one problem with that, like you assume that your your customer is significantly smaller, the amount of the images that you make hard, it means that at some point, you're gonna uh have the cash evicting to happen event and, for example, like you'll, have your application running on the kubernetes from the order image? This image got already evicted from all cache notes, but your absent provider is is gone. You cannot you don't.

D

You cannot really pull the image from the cache, because your cache is too small to hold that image. It got evicted very long time, but you are right now in the need of carrying this one week or deployment to much higher limits and the casting doesn't really help you with that.

C

Perhaps having some more information on like those patterns would be helpful here, because my gut is the is the things that you need to scale the most up probably, and this could be totally wrong because I have no background on this, but my guess would be the things that you're scaling the most also the hottest things in your cache um and and the things that are most critical to cache to to scaling.

C

You know, like maybe your you know, customer facing websites- that's got some new specials on it and those things are the most likely things to be in the cache rather than, um as you say, the you know the three week old uh image that you know it hasn't seen as much activity. I might be wrong on that, but that's just my gut feel.

D

I I know, but it's building on, like on a lot of assumptions that you are assuming, that it's gonna be highly available.

D

That's that's my concern really but like if it faces it spice badly and like it's, this mechanism doesn't will not help you, I'm kind of like thinking that like if you want to have like the actually highly available service, you need to have actual replica of this data in multiple resources, and you just try these sources um to fetch them and, like you, just retain these data as long as you need them basically, and this kind of gives you like the guarantee that, like if app is fluke, you just have another source that it's not dependent on that app and has another type of the storage it could be.

D

I don't know azure with their object storage. That is basically different technology, but actually have that highly available because, like that, uh it's kind of to some extent as soon as different services, availability.

C

How do you, how do you deal with the authentication story in in that solution? Like? Are you always going back to gitlab.com, jwt or token, or how do you ensure that the the client is asking for this um image? You know is allowed to see that image.

D

The app is interesting problem, but like in the current registry, like even the service is like external to to the registry, so um it could be even like I don't know, http basic off in in the simplest case, really to provide a jwt token so like it could be like the most boring solution really to to have additional storage for these images.

D

I'm kind of like thinking that, like we have this proxy in the container registry of the github, it kind of source some of these problems, but it's still like it's still. Caching, I just find like the efficiencies with the caching and assuming that the classroom gonna prevent this kind of like problem that customer had in this particular case.

A

But perhaps we are not actually solving a high availability problem here, because it's not what customer needs it. It might be something that they articulate they need, but it might be something that they don't need actually, and perhaps caching could be enough, but I I feel like uh if caching would be enough and in what form it depends on tiny technical details of their problem, and I I've not seen a complete description of the problem in a document or like.

A

I know that we talk to them a lot and uh that's you know the reason why we do have this blueprint written. But uh it's not entirely clear to me what the problem is exactly because it depends on small, technically the technical details that are not here and uh yeah. I just.

C

Just on that point, one of the things that would maybe help here is: we could look at the registry logs and get an idea of those traffic patterns for this particular customer. um You know for for a week, or we could look in in our in our indeed, you know in other storage locations and get more data than that or we could see sort of what the working set is and then you know we could probably figure out like how big like a cache would need to be and and then get a much better idea.

C

Instead of just sort of you know, we actually have that information. That's.

A

A very good thing for that suggestion. In my opinion, it might not like give us a full picture of events like uh the one with scaling and the release of a new product, but it actually gave us some insights about how they modeled their infrastructure behind. If we.

C

We could actually go back as far as as the incident in question and we could look at the requests I mean yeah. I mean it. Obviously it was distorted by the fact that it wasn't working, but we could even go and look at what they were requesting.

F

Do we keep that old.

A

C

Yeah we keep them in um in google cloud storage ironically, um so we could. We could pull that and then sort of work and we've also got like the bytes. You know the number of bytes that we sent, so we can from that. You know, look at what the working set size would need to be for a cache to have been effective and kind of do a better modeling around that.

A

Yeah that's interesting. I would like to still highlight one uh interesting in my opinion um proposal from the blueprint like the blueprint actually describes. uh I'm sorry for the background. Noise. Kids are back. So uh the the the blueprint describes um a notification mechanism uh designed to actually notify the on-premises thing whatever it is, either it's a cache or either it's something else, uh and it's simply a web hook being sent from gitlab to this external service that will allow to preheat the cache or worm it up whatever.

A

We call it, and I think that it can actually help not only this customer but other customers to design their own solution if they can simply go to github and configure a web web hook. That will notify their uh service that there's a new image pushed or a new version of the same tag.

D

It was very pretty, I think, useful feature to have a notification on the on the container push to like, with the information of the what was trans, what was the prior and what was now so. It actually seems like very good addition, regardless on on, like on the on the future world.

A

So my question is: uh we can actually get the data from logs and understand better. What is the customer problem? uh We basically, I think, agree that building a web hook a notification mechanism can help not only in this particular case, but in other cases as well.

A

Is there anything we can build? That is going to be a boring and simple solution like it does not need to be easy solution. It can be simple solution and it would be probably enough.

C

um Just one thing that I have seen in the past is that we one of the highest end points in terms of traffic on gitlab.com at the moment is that jwt was in points- and I think we've mentioned it before, but there seems to be like a one to one between fetching a container and the author request, and if we could reduce that or kind of gear that down a little bit, that might make a huge difference.

C

You know through some mechanism and each one of those is obviously going all the way through the puma, which is quite expensive as well.

D

Unfortunately, it's like the client doing that request. So it's it's like you are correct, like every pull and every docker push. Of course, this endpoint to get a recent trade wt, but it's not the server, but it's client doing that.

C

Right, if we terminated it in workhorse, would that make it better.

D

The the tricky part about this endpoint it performs very complex, half and house of the of the of the credentials so.

D

And our authentication code that is like using many contacts as it's pretty severe like to.

C

Consider moving that elsewhere, I really have to go, but I've just got one other very uh subtle point as to your point ali, and that's that weirdly, the jwt author endpoint, does not get routed to the gitlab api and for purely technical debt reasons. It actually goes to the web service, which is very confusing and not what you'd expect and something that people have been meaning to fix for a very long time. It's just about how aj proxy reads things, um but that's just something to notice it's very confusing.

C

It's an api endpoint, but it doesn't go to our api service, but I do have to run so thanks for the discussion. Thank you for participating.

A

uh Yeah, so it's a bit unfortunate that zhao couldn't join today, but I think that uh I really like this idea of uh checking in our logs how the customer is using container registry that it can actually give us answer to a question of how many images they do use and how big they are in in total and uh and then. I think that the on-premises caching proxy might be actually a good solution. But we need to simplify it somehow.

A

But in order to simplify, we need to exactly understand what what is the customer's problem without understanding that we only be able to simplify this.

D

I mean there is like this like built-in mechanism in the docker registry that is tuned for the index, docker that I only, but maybe this would could be really used as something uh for that purpose. It's like it's fully femoral, so it kind of fulfills that goal not sure how.

E

Functional wanna use that skill, I think, adapting that is, is part of the proposal that uh job write up um it's an option. I think I think with that. We do take on a lot of responsibility for.

E

Maintaining that, in addition to the um the main sort of registration, endpoint.

D

I I have also one more suggestion, which is after exaggeration, uh the jwt out is basically so frequent, so we could actually calculate pretty much real sla of this endpoint in particular, like taking into account like the duration and like how many errors were produced in the given period of time. So like because, like uh as andrew said like this is jwt out which is going to the web.

D

Freeze, not the api fleet, but since this is also like very noisy neighbors like it could allow us to estimate um if we would like run that endpoint separately to everything else. What sla we could offer like what was like the sla on this endpoint so far, and when it failed.

A

So let me check if we're on the same page, you you suggest to go to our logs to understand better. What's the availability number for the authentication endpoint, because we can calculate availability based on the amount of requests, successful and failed right, and this way we can actually get the concrete number from the last week, for example, or we can use the historical data to get it from the last I don't know month or or year. Is that.

D

I mean like yes, I mean like if you would somehow be able like to find intervals from let's say last year, based on the logs like when this endpoint fight. This could give us like very good in like hint about the sli of this endpoint. Historically, I mean this particular endpoint. It's still gonna be affected by the noisy nightboards, but I'm just curious about like this particular endpoint.

E

um That in point you're getting it really, the authentication ultimately relies on the database, the rails database, and I think that has a lower sla than this customer desires itself. It's necessarily you know that endpoint, you know, regardless of whether you can hit it or bring it like. I think it's ultimately relying on a service that the customer has indicated is not available enough for their case.

A

As far as I understand they do not really want high availability, they just requires shorter degradations. So perhaps it's not. You know that big of a problem.

E

Right I just I haven't, I haven't seen something: that's I appreciate the difference, but I haven't seen a proposal that really addresses having a shorter incident time that isn't also just making it you know five gallons or just level availability.

B

The the early work that we did after the incident happened helped because we added uh we added multi-zonal clusters and we upgraded our support contract with google. We were using like a third-party vendor before so now we have direct support line with google, so those things helped a little bit, I guess hopefully but unknown.

B

I I was just wanted to say that you know the I'm. If we went to this customer and said we're not doing this, it doesn't make sense for us it's not in line with our business. We suggest that you continue to use artifactory as a local on-premise, pull through cache and what we'll do is xyz.

B

I think they would be okay with that they just they want to hear from us, and I think that was what the goal of these blueprints was is to be able to go back to them and say: look we've gone through.

B

We've done this work, we've thought about all these things and we think here are some improvements that are possible, and then here are some things that we evaluated and don't really make sense because they don't solve your problem and it'll end up distracting from us solving core problems for the next year or something like that.

A

So, are you saying that they do have a workaround in place right now and yeah.

F

A

Can be like comfortable with us having a long term plan without actually having short-term solutions for them.

B

Yeah exactly they are currently using artifactory as a pull through local on-premise cache they're, not happy about it, because they're maintaining it and like kelly mentioned there, they want gitlab to be administered totally by gitlab. They have this whole core versus context. Approach that they want. No one will be better at managing gitlab than gitlab is what their architecture.

A

The proposal like describes building an on-premises caching proxy that they would need to maintain. So how does it like.

B

Yeah, I guess they were they. That was one idea that was brought up uh and the other another idea was like a multi-cloud pull through cache um so that they're they're reaching you know they they want to. They want to solve this problem, but they're looking to us to say we actually don't know no one. No one is doing that. Whatever's happening is not sufficient, so they they're evaluating.

B

Do we keep doing what we're doing and keep striving to build something in-house or should we does gitlab have a recommendation that they could help us meet this.

A

But but you say that they do not want to maintain anything, and yet we are trying to such as.

F

A

On-Premise discussion proxies. So uh if we build this on-premise discussion proxy, that's definitely going to be a huge effort and give it to them to maintain. Are they going to be happy with that solution or not.

B

I don't think it's their ideal, I mean haley could disagree. I think, but I I'm not sure that it's their ideal solution.

E

Yeah I mean something: they've mentioned, I mean they've sort of they've talked about their ideal. That is a you know, endpoint, that just is always available, um which yes, that's ideal. um I think they've mentioned this as sort of um a stop gap before that um service is up to an availability that they're comfortable with.

E

I think part of this proposal is to to show the customer and say this is a possible solution. How do you feel? We've worked, some of the technical words out and you know showing some light on what this would really be like um um I, I've been wondering as sort of breaking away from this. A little bit is if it would be possible for us to work with them.

E

uh You know have infrastructure team come and work with them on something that is uh like sort of a best practice for, like you know, having using multiple registries like not only gitlab.com, but also something like docker hub or key, or something like that and having having a an infrastructure that can sort of adapt to one of those endpoints being down, because I think that takes some of the the pressure off of us to be this endpoint that can that never goes down that has a availability, that's higher than what you see in most cloud services.

E

I'm wondering if that is something that we could. You know see what the feasibility of that is, because I think that that sort of seems a little more realistic to me.

B

I think they would be open to that, knowing yeah, I think they would for sure.

D

I'm kind of thinking that really like there are probably like two paths like one path which is like even like. We should kind of figure out a way to improve our service and like some of the aspects that your team mentioned like they did have, but we still have, for example, this happy concern that hayley is mentioning, but even there like, we can improve that like this. Like we checked, I think highly like recently that, like it's using using database replica, so it can be like really exempt from the all the other storms around database.

D

That may happen to the main fleet, maybe not with the most accurate data. So like even like with the tools that we have in the disposal, we can improve the api, endpoint availability today.

A

By particularly about the api authentication endpoint like we, we can build a separate service. We with a separate data, store and push like latest set of privileges and credentials to that separate data store and separate service. It might be like you know, more, like the computer service, but it's definitely possible to build something like that. The question is: do we really need to do that without knowing? What's the number of the availability for this endpoint right now or not, because.

D

I I'm not talking rebuilding the whole authentication scheme. I I think that this is really like the last thing that we should be doing. It's like it's. It's super complex, given the multitude of these schemes, but like like even like really like you mentioned that, like we had the troubles with the database they're like we have a ways to overcome that for that endpoint. If you would really want to because it doesn't require the main database, it can.

D

It seems to pretty much work on the replica today so like it may be less respectable to storm. But then I I heard that we also talked about the jio for the github.com, and this is maybe like the long term aspect like on how to solve that problem, because, like maybe, we would, at some point, have like the sibling gitlab running with all the same data replicated across different zones with the database and everything, and maybe like the jio, would be like the ultimate solution.

A

For groups geofor groups that you you can use still gitlab.com but replicate only data from your group. Your local instance.

D

But I also like that highly suggestion about like. Can we just push this image from to multiple places? Can we configure github ci to be able to specify multiple places for the cooling damage? Can we configure kubernetes to try multiple places to pull in that image, and this seems like there's something that like if they would buy the subscription on the docker hub or elsewhere, they could basically have like the cloud managed service, really that they would have all these data always replicated across different clouds.

B

Yeah, I agree with that one one change we made to the runner. That was really helpful is we gave them a variable that just says like dependency, proxy url and they're able to fill it in in their group settings so that they did they were concerned. They were going to have to go in and update all of their many thousands of developers, pipelines to point to a specific dependency proxy, but now they're able to use like a group setting for that.

B

So I think these kinds of small changes that improve not only what their experience, but all of our customers experiences are what we should. They seem to me. They seem more desirable and more reasonable. I think if we go and we if, if this solution for the on prem proxy and cash seems complicated and risky to to.

F

You all I'm sure it will seem the same to them. So I think I I I think my perception is like that. This, like, for example, changes push.

D

To money pull from man run from man, it's also something super disabled for us like we are still affected. We are so affected by the same problem. Our ci gonna be down if we cannot pull the image from the dev guitar park today right and it's like a big thing for our time as well so like. We are also affected by these problems really today that they are facing and like, um and I think that we really have probably the same approach to that problems as they have like we.

D

We want to maintain minimal amount of the components to run the service, but we also run the cloud native right now on the on the github.com, so we also have to pull many versions of the github from from the container registry of github or from the docker hub. So we also have the same problems that they have really and I I think like we would be really like the first customer to solve these problems, for because uh it's like, um if we cannot scale up our application, they, then our customers gonna be affected as well.

D

If we cannot like uh release the fix, our customers gonna be affected as well. So I I think, like this is the important aspect of that is like they have the problem that we really have as well. We just didn't yet encounter this to be a problem for us.

A

Yet so uh that's interesting, and I think that we should take a step back and try to understand the customers problem again like as andrew suggested, we can pull data from our elastic cluster. To um um like understand the how the customer is using container registry better, then we can calculate the api availability number from from the locus as well.

A

Then I think it it would be great to actually triangulate the solution uh uh looking at what we might need as a company and the first customer of content registry and what other uh users and why their community might benefit from. I wonder team, if uh actually it's it would be possible to get uh like a document in google docs or something that uh I will describe the the problem that the customer is facing a little bit in in more detail.

A

I I know that you, you had a lot of meetings with them, but I wonder if that's written down somewhere.

B

It's definitely written I'll have I could gather it and put it in a central place that yeah, I think we can. We have that information.

A

Yeah, so so I think it would be extremely helpful to actually you know: have it in a single place so that you know people interested in uh helping with fighting solution could read that and understand the problem better. It's uh I'm always thinking that you know understanding. The problem is a prerequisite for finding a solution, and uh if we could triangulate the solution with what we need, what what the wider community needs it. It will be probably the perfect solution.

A

So uh my like there are no perfect solutions, but it it might be actually good good enough solutions. So uh I wonder if we can. Actually, you know, have all the free action points um done before the next meeting, because I feel like that's a very you, know, interesting problem and, having uh I don't know, bi-weekly meetings to discuss. This would actually help a lot as well and then next time zhao perhaps will be able to join as well.

B

So, for me, the action item is just to have the problems that this customer encountered uh in one place and that I think we should have it to the main epic that we're using the architecture for just I'll just make sure it's there and I'll share it with the people on this call restaurant. That would.

A

Be perfect now we try to get data from logs so that we could understand uh how they are using container registry better and also calculate the number of what is the availability of the container registry authentication, api endpoint.

B

And and the next output from there could be now we we have this problem. We can are looking at the data. A next step from there could be a follow-up meeting with the customer to better understand what they've done or maybe make some recommendations on how to pull from multiple registries, as well as we'll be able to now talk through some of the ideas that were brought up pretty shallowly. In our initial conversation like oh, we want a a cap, an on-prem cache.

B

Well now, we've evaluated that we've wrote about it and it doesn't make sense. So we want to tell them: that's not really something, that's not a direction.

A

Our you know, discussions on the merch request can actually be something that is interesting to them as well, so yeah, our collaboration on on the blueprint might actually you know, uh be good for them to read and uh understand that we actually explored all these ideas thoroughly. I totally agree.

B

Okay and and then we'll have some recommendations that are in line with our goals, but you know, like you all mentioned, we encountered these problems that well so increasing the reliability and redundancy is a good thing, but we should do things that make sense that are the boring solutions and that are iterative and not take on big scary projects. To start.

E

Yeah, and just moreover, I mean something that more more users have access to. You know something like this: one from caches.

E

A big shop can set it up and run it right, but I'm worried about unleashing something into the world that has a net negative on availability, because people will set it up in ways that are not more available than the service itself. Reducing availability.

B

That makes sense yeah I'm on board. With that too, I like the idea of improving the auth api, whatever we could do around that and anything that we could do that around. I know we said: caching won't fix everything, but if there's something we could do there and some infrastructure recommendations. That would be great.

D

I'm kind of thinking like the blueprint in the current way how it's written like it, describes a solution.

D

I think that, like the right way for this would be to step back and describe the problem in as many liters as possible and and then like figure out exactly which parts of the problem we want to solve because, like I think we we mentioned a few potential improvements to the whole, but, like whatever is more important right now. It really depends on like on the on the description of the problem and now, like I think, for the user.

D

Looking at the blueprint, it seems that, like uh we like, we gonna work on that particular solution to to describe some some subset of the problems, but I think the general idea behind the blueprints is like high level definition of the problem that we are solving, maybe with some hints and different approaches how they can be solved, but, like the blueprint, is not to describe the actual solution. I think the epic issues, but all the blueprints are.

A

Yeah, so uh what I, uh the uh um my strategy behind building blueprints, was always to describe the problem and the vision that will help us uh like to solve that problem. Like architecture is always a hypothesis and you iterate on architecture to in order to actually uh uh prove your hypothesis and then, if something doesn't work, you adjust your trajectory so uh yeah, that's just a random thought. Okay, so uh we do have a production points.

A

um uh I will create issues uh and uh then I will schedule the next call in presumably something like a week or two weeks and uh yeah. I just wanted to thank everyone for waking up early and uh joining us thanks thanks camille for for joining as well and again it was great that andrew joined.

A

So uh thank you very much and see you next time have a great day, bye and.