GitLab Deep Dives, 26 Jun 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab Debugging Techniques: A Support Engineering Perspective

Description

A dose of deep GitLab foo & healthy snack options from Lee Matos, Support Engineering Manager
deck: https://docs.google.com/presentation/d/1kqAvqqSPYmD0OdvYsAdLPEJFJ42J0FVsQOIeg2Jcpaw/edit?usp=sharing

A

For example, I bought some Bluebell almonds at an airport and then I compared them to the nuts com almonds, and they were just like nuts comments were bigger and better, and so anyway, eat healthy snacks. Nuts calm great thanks for looks.

B

Like you've gone nuts for nuts calm, yeah.

A

So I was out of town for a few weeks. I'm kind of doubled up on shipments I didn't cancel one because I was like and they do so anyway. I even have more I couldn't carry mom, but.

C

Nicely sized for travel to that! That's something I, definitely do when I'm on the roads, nuts, dried fruits, all manner of stuff that has.

A

It all and you could get other sizing so like you- could get whatever you want at whatever cadence it's beautiful, I'm doing this thing in my life right now and then we're gonna get started. It's 101 where I'm doing what I'm calling my random weeks and then my structured weeks I need a lot of structure in my life. Oh, this is all recorded shout out to the recording yeah you're, getting deeply insights anyway.

A

I signed up for I'm in a meet Club, so I get like a meet si si, but it's random I never know I'm gonna get and then the next thing I signed up for was this thing called misfits market, which is just random grocery store items that they wouldn't sell because they are not beautiful enough to be sold in a supermarket. So now, I have like a week where I get random meat, random vegetables and then the other weeks. It's like very structured, I know what I'm getting at the store. Well,.

C

The future is already here, it's just not widely, but ya know it's great to hear about all these channels for getting amazing food at whatever cadence or level of freshness. But the fact that you have that option is incredible and yes, we are recording, but this is this is how we do Haley I just really want to say. Thank you so much for doing this.

C

You know this an earlier iteration of his diet. Deck came across my virtual desks and you know for us in professional services and training future get lab admins. This was very relevant and you know part of our goal is that you know we're growing with our customers and we want to iterate our training content to kind of match where they're at and where they're going. So this definitely felt in that vein, so I will hand it over to you because I know there's an intro in the deck itself.

C

There are notes, as part of the calendar. Invite I can help take notes and yeah Lee. It's all yours great.

A

What I'm gonna say is if you want to throw the notes, link in the chat as well I'll pull it up on my other screen. So if you also have any questions in there, just put them in there and we could kind of. There will be some break points where I can open up the floor for questions as we work our way through and then from there we'll we'll see what we have and just to confirm. I, don't recall, do I have the full hour.

A

Yes, you do. Okay, great all right team, I'm gonna share my screen and give me a thumbs up when you can see it and then we'll go from there.

A

All right golden, so this is get lab. Debugging techniques coming from a support, engineering perspective, I joined, get lab around version, 815 I've been pivotal and trying to grow our support, engineering and and efforts future and further efforts to what we're calling data-driven decision-making for all of our support. We want to base our decisions on data as much as possible, and these are some of the things that we've learned throughout this journey.

A

It was birthed by a customer's requests, and that gave us this framework and I think it's gonna be a great resource to share with other customers as well. The first thing that we're gonna start out with is thinking about common problem areas and we're gonna walk through them, specifically with get lab the app and the first one we're gonna start with would be unicorn errors.

A

Unicorn is a huge component of git lab when we write code for git lab the majority of the code we write is written in Ruby, which runs on Rails and when you deploy that into a production environment that is being served by unicorn now the three things that we want to pay attention to would be production, log, unicorn, standard error, log and the work horse current log- and these are a couple of things that will dive into as we go through this deck and we'll think about, and what I want to move into is kind of take a break to look at the architecture of get lab.

A

Get lab in its current form is huge and I always describe get lab as the collection of software and services that give you the get lab experience right. It gives you that end-to-end experience, it is not. We do not write all of it. We glue a lot of it together. We write most of it, but a lot of it is leveraging open source and other commune standard, tooling and protocols.

A

This diagram comes from the gitlab architecture overview page. If you haven't been there, you can come and see this page here live. This is an incredible resource that we're trying to make even better I want to make these elements clickable. But if not, if you come to the sidebar, you can click on nginx. It will take you right to telling you about where, in our docks and things you could find it, how we describe it in to its. Is it a core component or monitoring and a little bit of description about what that does?

A

So, if you're, ever, if a customer's ever asking questions- or you have questions about well, what the heck does you know, Redis do? If you come over to the sidebar, you come to Redis, you can see exactly what Redis does and Redis is stored. We store session data, temporary cache info and background job cues boom. This is a Bible, so just jumping into itself.

A

When I talk about the first section of gitlab, where a lot of errors occur, that would be unicorn, which is highlighted with this arrow here now. This is because unicorn does so much heavy lifting unicorn is a huge component. It gives you the UI. It is handling a lot of database interactions. It is basically the heart of gitlab.

A

If that is faltering, you're gonna see the ripple effect happen in two other places. I also mentioned workhorse, which is right here. Workhorse sits in front of unicorn that was in technology developed in-house by us, with the intention that when unicorn is slow, a workhorse should be fast. So we put it in front of unicorn to give to free up some of the resources when unicorn would struggle and that helps us move faster to be very specific about what workhorse does there are subsets of requests?

A

You could see work horse going straight to get early and doing some other routing work horse can skip unicorn completely for certain types of requests when you're doing HTTP cloning and whatnot, so that you don't have to talk to the rails app and you can be faster. So that's where unicorn sits in and, like I said the cheese it's gonna do this to me today. There we go come on unicorn itself is the core: the heart of gitlab I'm, going to take a step forward and talk about debugging unicorn and a few bits there.

A

I want to pause and say, based on what I've said, any questions now anything come to mind. I'm gonna, see if I can open this doc in another window on another screen, but it might not have been so any questions.

D

Yes, this is Mike Lindsay, a professional services engineer the diagram.

E

Is great, do we have a diagram that maps this to our H, a architecture like we only start slowly, services out where these things end up at, like you know, I know we see like the unicorn they can leave, but what about like say the citrine sure.

A

Thank you Mike great question. That diagram is in progress because there's a couple of different variants there and as well as thinking about how our H a solution has evolved over time. The diagrams look subtly different I gave a response yesterday and select the Francis Potter I'll link it into this note stock. Where I talked about that.

A

Basically there's a couple different ways to distribute: gitlab H a and set it up, and we are working on what we're trying to define now what we're calling the 10k reference architecture that will be the diagram that you'd basically want to be looking forward to, but I'm sure there are a bunch of other customers in the wild that have subtly different variants of them. So 10k reference architecture, document, I'll link to my answer to Francis yesterday, which outlines a little bit more and I.

A

Think that will help and then from there any further questions you have. We could take offline how's that sub that's great cool awesome. Any other questions from the team. Great question, Mike I appreciate it. Haylee.

F

Question and thank you- this is already wonderful. Okay, so you said: workhorse was developed by Gila, so that's just to get lab tool.

F

What about unicorn I? Don't know much about that sure.

A

So Unicorn is a open source project that is many many many rails applications use. Unicorn workhorse was developed in-house. We have a blog post about it. That was written a few years back. If you want to read more about the very specifics of workhorse itself, if anybody knows that blog post, you can link it in the doc and Christie I'm going to encourage you to put your question in there as well, just so that we have it recorded.

A

So people can comment and I'll I'll put some more notes, but unicorn urn is a open source project. I do want to take that moment to say: unicorn will be going away in theory and you're like huh what it's the heart of gitlab we will be. We have proposed switching to another technology, called puma and puma versus unicorn have subtle differences in the way that they work. Their ultimate goal is to serve the rails, application to serve the code that we've written, but unicorn and Puma do that differently.

A

I'll describe it like this, and then we could talk about it more later. Unicorn uses a worker model and Puma uses a thread based model to handle the requests which have different trade-offs and different things. The main reason we would want to switch to Puma is for memory, consumption, I'm, sure, you've seen a lot of people say: gitlab uses a ton of memory because we use Ruby because we use unicorn we're trying to shift that a bit but Kristi I hope that helps any other questions from the team.

A

All right, I'm, gonna, move on to the next section here, where we start to dive into debugging unicorn itself, so unicorn will log its errors to a folder or a file. Excuse me code, unicorn, standard error, log on add a fault, get lab, install that will be var log, get lab, unicorn, unicorn standard error log, if they're doing anything, crazy, it'll be somewhere else, but the file you're looking for is unicorn standard error. Log now I have taken the two types of errors that you'll genu generally see in that log.

A

Here, there's a third type that would be related to LDAP, but it's more of just a warning, not an error. The two big things that you want to think about: you have timeouts and memory killer and I'm gonna talk to them and how they work a little bit. So, for example, we have this here. This is a timeout in the unicorn standard error log. We see that worker number two in this instance, timed out four seconds is greater than three seconds, so it killed it.

A

It shows us that it killed the worker and then the worker is back up and ready, so that is a timeout. Now I'm gonna talk more about that. We have memory killer, which memory killer. We can see a very similar structure warning the worker. The process ID here exceeds memory limit. They give us.

A

The bytes is bigger than this amount of bytes we're going to kill this worker with this process ID it's five, six, eight seven, it's been alive for six hundred and forty-five seconds, we've killed it, and now we know that that worker was worker. Number three worker number three is ready now, you're saying: okay! Well, what does this mean? Actually, what is happening? Why is unicorn killing things? What is it? What does that mean? So we're gonna take a second to talk about unicorn and how it works.

A

Unicorn itself will spawn worker processes and, depending on how many CPUs you have on the server, we recommend adding two more processes than the amount of CPUs you have. So if you have 16 CPUs, we want you to have 18 worker processes. What does that mean? That means there are 18 instances of gitlab the rails application running each one of those in unicorn. Each unicorn process will handle web requests that come in. So as a request comes in worker number, 2 will pick it up, and the next request comes in worker number.

A

3 is gonna pick it up. Next request comes in worker number, 4 is gonna, pick it up and they'll go on and on and on and add Infinium now what happens is depending on what the request is doing? For example, if you want to load an mr page that has 20,000 comments, we have to go and find those comments in the DB we have to render them.

A

We have to do all of this stuff if that code, that is running takes longer than the timeout that we set then unicorn will kill, that request, killed exceed me, kill that worker, which will end up with that request. Failing we've seen this happen, yeah I'm sure you've experienced it. What that does is allows unicorn workers to not get tied up forever.

A

Imagine a scenario where, if you have 18 workers, that means at any given time you can be handling 18 things. If 18 people try and load that page with 20,000 em ours, and we don't use a timeout, it will just fill up the whole server. The further requests will fail. They'll queue up, the service will be degraded, gitlab is still there and running. It's just now, no longer available to the people who want to access it after those 18. So that's where timeouts come in now.

A

Memory killer is a very similar type of thing where each unicorn process, as it does work it may need memory. You know if you're loading 20,000 comments, you're grabbing them from the DB you're trying to render you're gonna need to grab some memory. Now we have memory killer in place so that it an ad system, instability and you're saying well. What does that mean? If a unicorn process needs more memory, it will grab it after a certain point.

A

The system might not have more memory, then you have a process in Linux, called the oom killer out of memory killer and what the Unicorn memory killer is trying to be more aggressive than the oom killer, because what happens is on a linux based system, the oom killer will just kill what it thinks is best, and that might mean a service that we really don't want to go away. So we're saying we'll self-police Unicorn will try and keep an eye out now with that.

A

These two values are tweakable, but they come with caveats, for example, in an a che environment. If you are running a node that is just running unicorn, then you can probably raise your unicorn memory limits pretty high, depending on what else is running on that box. But if it's just unicorn, you can raise those up to take the amount amount of RAM that you have available, because nothing else is going to be competing with it.

A

That's something you could do what that means, though, is there are some requests that could balloon and use a ton of resources? That's okay! If you have the resources available to you, if you don't, then you want to make sure that we protect it same with timeouts. You can raise a timeout, but what that means is for those requests that would have been timeout killed. You are now going to just take longer. That's a bad experience, so we tell our customers.

A

Sometimes during debugging we will temporarily raise timeouts, see what we can do and then try and reduce them back down to the defaults. There are some instances where you may raise a timeout. It will generate the page cache the data that it needs. You can lower the timeout and future requests we'll be fine. Then you could talk to support engineering and things like that, and we could try and understand why, in their instance that we need more time right was it lack of CPU speed? Was it they didn't have enough resources?

A

Was it some other thing that was just so slow or was it? The code was not optimized as well as it should have been, so those are some of the ways and things that you'll see in Utica under standard error log, if you're raising time out, so you want to do it temporarily, otherwise be ready for requests to potentially take longer and for memory killer.

A

If you have a dedicated, node running unicorn, you can raise that up to use available resources, but if not, it's meant to protect the other resources so that Unicorn doesn't get too greedy. So I'm gonna pause for this. Second, any other questions. Thoughts on here before I, move on to talking about a few other things. I see a hand raised, go for it. So.

B

How do you raise the timeouts? Was that in the UI or is that Nikhil a Barbie? It's just a great question.

A

There may be an option to raise the timeout in the UI. Now I would recommend the RB is how we in support engineering traditionally. Do it.

B

Requires a reconfigure, correct, correct, correct, so it's down time incorrect.

A

That, depending on your setup, so for example, you could there are ways in what you can get more advanced where you can raise the timeout and then have unicorn do a rolling restart. There are ways to do rolling restarts in theory, but again it depends on. The needs depends on the client in an H, a install. If we were doing that, if they have other nodes, then that restart of one node should be fine. But yes to be very clear, a reconfigure and the basic case would require down time.

A

It will flush connections and restart unicorn. Okay,.

B

So you mentioned in the H a environment. If you did this on one node, that would be sufficient for the whole system or just for that node. So.

A

Depends what's tricky and depends on how clients are managing their H, a installs if they're using a chef, puppet or something like that to manage their installs then effectively, we want to tell them they can roll it out and do a rolling one note at a time to make sure that the whole unicorn lair doesn't go down if they're, not if they're doing it by hand, you can log into a one node and then target your request to that node say, for example, if you have web frontend number three and you to raise the time out and you go to web frontend number three and do your request there.

A

So that's a possibility. If they're using kubernetes kubernetes will do it all automatic, auto magically. But yes, so it will, depending on your goal right. If your goal is, we want the entire time out to be raised across the board. Then you have to roll everything and restart. If you want to test, you can pick one node target your request to that node and then test from there awesome got a thumbs up questions on this section. Any other thoughts before I move on this will tie into more deeper as we go on.

A

It makes sense, it'll kind of build on to it, but I want to just make sure if anybody has anything else, we'll all right, not seeing any movement. So I'll go on to the next bit here now this next slide. This next section is unicorn errors and I have, in parentheses, workhorse workhorse is separate from unicorn, but they work in unison and they the requests that every request that go to unicorn are gonna.

A

Let me rephrase that let me think about this to make sure I say this correctly, when someone is trying to access a gitlab resource.

A

Yes, every request should make its way through workers. So this is a huge input zone where you can get a lot of information right and it sits in front of unicorn and it will route some of those requests to unicorn. It will route some of them to other resources and what I want to focus on here is this debugging trick, and this came about recently we figured this out.

A

I went on a spirit, quest to try and add into unicorn I wanted to make it tell me what the request unicorn was working on in this output, it's very hard to do near impossible, but when I was working and talking with Stan, we figured out that actually the way that this will be bubbled up or you'll see it is in the work horse. Log you'll see this error message: bad gateway, so, for example, for this request here failed after five seconds and a file message method get URI /. This was when I timed out.

A

This is what I saw in unicorn. So basically you can look at five to ten seconds or however long your timeout is maybe a minute thirty seconds before your reap message in unicorn, errorlog and you'll find out what the person was hitting. This is really valuable, because now you have potentially a reproducible thing as we think about gitlab. It is a system, an ecosystem, an organism that it could be under heavy load and, for example, if I timed out, because resources were constrained or something like that and I retry.

A

This request in five minutes and it works now. I say hmm that was slightly non-deterministic behavior. What was different? Okay, it's probably load if I try this again in five minutes and it still fails and I, try it again in ten minutes and it still fails now. You're, like okay I've done this three times it fails every time this looks like a bug or a problem that we need to get to the bottom of not related to load as you're working through and thinking about get lab. Those are the two thoughts a support engineer has.

A

Is this problem related to load in any way or this an actual bug or processing problem, and this is where we see those errors in workhorse now, I'm gonna point to workhorse one of the most interesting pieces here that we'll talk will dive into a little bit more as we work our way into the stack is 500 errors right and how and where they get generated.

A

So if you think about gitlab as a stack of software, you have nginx in the front behind that workhorse and then you have unicorn which handles rails, so nginx workhorse unicorn. These are your web requests. These are things on poor 84. 43 are gonna. Go here, the other type of requests you would see it would be an ssh request which would be used for get data or things like that. So what I'm saying here is, for example, if unicorn has a timeout or a memory kill that process goes away.

A

It's gonna bubble up to workhorse and workhorse is gonna. See that end of file, error, workhorse, is gonna, say well the thing you tried to get. We had an error, then it's gonna bubble up to nginx and nginx will say. Well my the person behind me had a problem and they told this is what they told me and they're gonna show it to the user. So in the the stack basically unicorn workhorse nginx is where you're gonna find, where the errors come from 99% of the time it's gonna start with unicorn.

A

Very, very, very rarely do we see a problem where it's like workhorse was gone because workhorse is pretty stable. It doesn't do much it's a very simple, router proxy. It's not a heavy lifter. It just acts very quickly right and then engine exits in front of there as well to be very efficient at handling and queuing requests so sometimes, depending on your errors, you'll see it generated from subtly different spots in that stack, but those are the hot spots right now notice.

A

In my discussion, we've focused on unicorn because it's the heart and it's where we get a lot of problems. A lot of these other pieces can have problems but they're, usually not showstoppers, right, they're subsystems or all subprocesses that are doing things, but aren't usually going to take the whole thing down these. This kind of stack here and I'll talk a little bit more about giddily.

A

This kind of for process stack are the huge, huge pieces that are heavy lifters heavy process heavy spots, where that's where you're gonna see a lot of problems now, I want to move forward to talk about 500 errors, which is what I mentioned the three that you're usually going to see and get lab 502, 503 or 500s and I'll talk to each one of those right.

A

So, for example, if you get a 502 in get lab- and you probably have seen this at various points- maybe with clients or the the classic spot that you'll see, this is, if you restart a get lab server and go to the you, go to the URL, so get lab testing comm or wherever your URL is you're. Probably gonna see a 502. If you just do it real fast. Why is that? Well, what happens is when you're bringing up get lab and it restarts everything. Nginx is super fast to start up, it's really fast.

A

It's really efficient. It knows what it needs and it's it's. It's done. It's ready. Workhorse is probably up next and pretty quickly. Well, unicorn. If you have 18 processes that need to load probably close to a gig each of resources to setup that takes time. So when you see that 502, it usually means unicorn is not available. Now you have to ask yourself well why well like I said if you generated the restart well, it's restarting, but it's also possible that oom killer.

A

Excuse me the unicorn killer killed it or it had a timeout or something along those lines. That could be a reason. Why you're seeing a 502 right and that's why sometimes some people will see 502 s and others won't because they're requests, the worker that was handling their request, went away and they're gonna get a 502. Now there is a very unlikely chance where, for some reason you have completely saturated your system. Unicorns are 100% loaded that can't handle anything.

A

Nginx is sending requests through workhorse war, of course, is trying to get to unicorn and it can't and it'll generate a 502, but that's very unlikely in the wild. That's very rare that I've seen that happen, usually the failure mode is much more unicorn. Workers are restarting very frequently. Some requests are getting through its kind of slow, but you're, not gonna, see that 100% saturation it'll just be very constrained and you'll see a lot of errors in the log.

A

Now this one 503 gate lab itself doesn't throw this that I've seen what I see this come from is the load balancer in front of get lab is down or load balancers, somehow misconfigured. This error message right. If you're running a distributed system or some other type of server in front of gate lab, if get lab as a whole, the entire organism is gone. Then you're gonna see that 503, so you'll rarely see 500 threes unless the whole service is down.

A

At that point, it's like you know something is wrong or if you don't and you're like hey, get up through 503, then it's like okay, the balancer tried to talk to get lab, tried to talk to nginx and nginx. Wasn't there? If that's the case, then you have to figure out why it thinks nginx isn't there, because if we think about this process, stack right, nginx to workhorse, unicorn, giddily, nginx, extremely stable, rock-solid, very rarely a problem.

A

Unless you have it misconfigured we're course very rarely a problem unless it's misconfigured unicorn heavy lifting, so it's breathing heavy it can be exhausted at various times. That's that's possible, so those are kind of the errors in in the last one. One is the 500 which you've probably also seen, which means there's a logic error in the application. These are very easy to debug, because if you go to the production log, you could see the error message.

A

It will throw it out and you'll want to get that full trace over to support, because then we could say if we've seen it before or we haven't- or this is new or it will tell us exactly what the problem is. So it makes it very very, very clear.

A

I want to take a second to pause here. 500 errors, kind of that stack with nginx workhorse unicorn, we're about to start talking about giddily thoughts, questions, ideas, concerns.

F

All right question and alright I'm in the dock, so one thing you mentioned the full production log when there's this error and then I know sometimes like I think I feel like with relaying this back to the CES team. What would like be the base sentence you would want us to share when say a customer is saying I have this 500 error? Would we say please share your full production log with support when you submit a ticket I'll.

A

Talk to that a little bit later, okay, but if the customer is savvy, I'm gonna take a moment. I'll share this theory. I've been working on doing some conference talks later and I'm, trying to like come up with this theory that that is like the premise of support engineering at get lab, where I think there are two types of customers, you have a customer that are sharp and you have customers that are scared right. Well, let's think about how we define that we know the sharp customers they're customers that they're probably doing something advanced.

A

They seem very comfortable, probably with the command line. They seem very comfortable with gitlab as the ecosystem right. They understand the stack and how that kind of fits together. Those customers are bringing new ideas right. Then you have the scared customers. They may know a little bit about get lab. They may know about what nginx is, but they feel scared to touch it, because they're scared, if they touch it, they're gonna, break it right and those customers depending on, if you have a sharp or scared customer, is probably gonna change.

A

The answer right so the safe answer is, you can ask them: hey, give us the production log. If they're in an H, a environment, we need all of the production logs right. We need to be able to see everything, and if they are a sharp customer, you could probably pinpoint and say hey hit that page generate the in your production log.

A

At that time you should see an error message paste us that full stack trace and that should be a plenty for support and a sharp customer should know exactly what that means and how to do that. Now with that theory in support engineering, we're trying to figure out well, how do we turn scared customers into sharp customers, because sharp customers are fun to work with the Scared customers are hard to work with, because they're scared. Anybody, that's scared, is hard to work with. So that's kind of Lee's metatheory, here long answer to an easy question.

C

A

Questions oh good.

A

Thanks I'm trying I'm trying to make it work. It's it's gonna be interesting, but yeah trying to figure out. You know how do we? How do we make that work? I'm gonna move on to get aliy if there are no objections, all right scenario deadlines exceeded. So now this is giddily.

A

This is an extremely important part of gitlab. It shares the name get in the name giddily. So you can understand that this is probably doing something very important. If you're not familiar is giddily, we have a lot of information out there around Italy. You can go to the architecture page, you can click on the get early in the sidebar. You could see where our documentation is our motivation, blog post stories theories effectively giddily will sometimes throw you. This error deadline exceeded you'll find this in the get early folder in the current error directory.

A

Excuse me file, so if our log gitlab get early current, is the file name and you'll see a line that looks like this warning health check failed. It will give you a an RPC it'll, tell you deadline exceeded and then it'll tell you the worker. That was working on it now you're like what the heck is a deadline exceeded. This is another case of self-inflicted gitlab, trying to make sure that it doesn't destroy itself. It is saying ah this took too long.

A

I'm gonna, stop working on that and I'm gonna start doing other things now, for example, if you are going and you're working in get lab and you go to that, get early current log and it's flooded with deadline exceeded. Well now you probably have a problem where you're down where things are all breaking those kind of things. If that's happening, then the issue is, we are struggling to get access to disk right. It is taking too long to get to the disk whether that's network or the hard drives themselves weak.

A

It cannot get the data fast enough. There are three giddily deadlines: you'll see them in the UI, they are tweakable, but you do not want to tweak them higher, because all that that does is say, ok, you can be slower and effectively. Every request now can be slower.

A

So what you've done is you've solved the problem by you're, not gonna, see deadlines exceeded any more, but deadlines exceeding is kind of a flag to say: hey, hey, hey, hey, go see why your disks are slow right, you don't want to say well, I have deadlines exceed, is how do I get rid of this error? That's not the right thing. The right thing to say is I have deadlines exceeded.

A

Why is this slow right and then, when you understand that the deadlines exceeded will go away because you've you've solved the problem for them being slow and I just wanna pinpoint their those three. We have get early deadline fast, normal and I. Think we call it long or slow effectively. We've done some research, and we said some git commands should be really fast. Some take a little more time. Some take a lot of time and we've structured that in that way, so that we can fail even faster so that gitlab can be more reliable.

A

It seems counterintuitive and a lot of customers that aren't used to thinking about software at scale are not used to that thinking where it's like. If we fail a specific number of slow requests, we have may have better uptime and it's like oh yeah, fascinating, so that's deadlines exceeded that can bubble up right if giddily is slow, that can potentially cause your unicorn process to be slow because it's waiting on giddily and that could bubble up bubble a bubble up, give you other errors, five-oh 2s and things like that, possibly right!

A

So now, we've kind of worked our way, all the way to the back of the stack to the bottom giddily on that base layer, giving you deadlines exceeded now with that giddily is tied into a couple of other systems right those deadlines exceeded. Errors may show up in other logs right. You can see it. Deadlines exceeded sometimes pop up in a unicorn log in a production, dot log or you could see you know, get early time.

A

Deadlines exceeded, I've seen them pop up, sometimes in CI runner logs right, because the runner was doing some things and then deadlines exceeded. So it bubbled up the error message, so you could see it in other places, but if you see that deadline exceeded the disk, access for get data was too slow as a temporary measure for an extreme stopgap. If it's a everything's on fire, we just need to get this one release out.

A

You can raise those timeouts otherwise, raising those timeouts are gonna cause massive amounts of pain, they're gonna cause suffering plight, drought, the locusts, will come like all of it, so don't don't, dare do it unless you're doing it for five minutes, because the the release is blocked completely. You just need a good to release out and then you're gonna go and immediately figure out why the discs are slow. So that's giddily.

A

That's deadlines exceeded. That's really the the main error message: you're gonna see from giddily the most valuable, most important questions or thoughts on that before we move on to sidekick.

A

I saw the chat bubble up so I just want to make sure nothing important in there. If zoom will render me chat, I lost it cool, so nothing that is needed to be answered in the chat will go on to sidekick notice. We haven't talked about sidekick yet, but we will I'm gonna pop it into an architecture, break and point over to sidekick right well with the hecka sidekick. Well, it hangs off to the side. Sidekick handles a lot of background processing. This is designed so that gitlab requests can be even faster.

A

Well, how do you do that? Well, you do it asynchronously!

A

Imagine, for example, if you had to render the mr page in unicorn, if you're doing that, if you're doing that synchronously you got to load it all in, do all this blah. If you take some of that and say, hey I'm gonna do some of this asynchronously and report back now you can go and you could throw that off the sidekick and then, when sidekicks done it could send it back and it's not gonna waste time. So that's the possibility of what sidekick gives to you.

A

I'm gonna come back here it logs to the sidekick folder the current file in there. So when a sidekick job fails, it should give you a stack trace. It should say failed this thing. This is what we did. This is what happened when that happens, you throw us that stack trace and support can start to debug. If you are enterprising, you can go and see. It should tell you what worker it was working on and, for example, you can see in gitlab how many workers we have these are background.

A

These are all of the things that we background process and get left right at any given time. Gitlab is handling all of these things in the background, whether on a schedule or ad-hoc, depending on load, it is doing all of this. So, for example, if you had a new issue worker fail, it would be new issue worker fail because blah- and you could come in here- and you would say: well, it probably failed on one of these three.

A

It would tell you you know, and it might say like can't, send the notification cuz the DB is down and you're, like. Oh wow, okay, there we go, you know, but you can start to check out what what sidekick is doing. There's one last thing: I want to share about sidekick itself. Sidekick has some modes where you might see failed and dead process uh jobs, sidekick calls them jobs failed or dead. A failed job will retry by default five times, depending on the job. It may be subtly adjusted.

A

If a job is dead, then that means it's done. Retrying. It's not gonna. Do it anymore now. What I want to say here is the way that sidekick is set up and works. For example, if you do something it does it in the background, and it fails you're usually going to see it in the UI. Somehow whatever you were trying to do is not gonna be there. So that raises the question: why did sidekick fail if it was load related or whatnot? You should kind of do that. Retry, loop retry.

A

If it works, okay, okay, interesting so I'm kind of load called sidekick to not be able to process this. If it you do it again and it fails. Okay now it seems like sidekick. Is probably having a logic error, so we need to go look in the logs. We need to go see what errors we find. So if we do that, that will give you some examples. So failed jobs aren't the end of the world. Dead.

A

Jobs are also not the end of the world, because if a job is dead and you as the person are like not getting what you want, you're gonna keep trying or you're gonna ask support. You're gonna we're gonna figure out why it died right. So it's one of those things where it's not the end of the world, where, if you see a failed or dead job, usually you're gonna continue to do the thing and we're gonna either fail again or figure out why it failed. So that's those are my musings on sidekick.

A

A

A bottleneck right, it's something that we want to manage and, depending on how you architect, this the setup for a gitlab install, we usually separate unicorn and sidekick to try and give them let background jobs. Do their thing. Let the front end be decoupled, so they're not fighting for resources, but beyond that unicorn is relatively stable. I, don't usually see it running into problems, but you can have unicorn failed jobs and things like that where we have to tune our code in our logic, so thoughts, questions before I, move on to the next section.

A

All right, we're gonna do it. So this goes back to the earlier question. Here's a scenario when the customer does not give us a clear problem description. This is the worst. This kills the support. This is not fun for us, so when you're working with customers feel free to share this deck with them feel free to share this video with them. Let them know like we want them to get us close to the problem as possible.

A

Now there are times where they may not know, and that's okay, that's what we're here for, but the more that they can give us pinpoint or the more clarity the faster will work. We also want to understand our architecture and relevant logs, so they can pass them over. We built a tool called gitlab SOS that customers can use if they're a scared customer get Lab, SOS is gonna, be their best friend get Lab. Sos is something that can run it bundles.

A

It up gives us a zip file, they can send it over to us and it gives us everything that we want to know really really helpful. Sharp customers use it as well, but sometimes a sharp customer will look at the output and then say hey in my SOS output. This looks interesting. That's awesome! That's what we want more of. We can't expect that all the time, but the more customers we can get understanding how gitlab works, the better experience they're going to have, because it's not gonna, be a giant black box.

A

It should be a giant glass box. They should be able to see and understand what's going on inside, so we want to get those stacked races because they're super valuable, because without a log message or error message we're just in the dark. For example, we see this happen all the time we're seeing a 500 error when we visit this repo and then they send us an image in a Google Doc and not even a word doc with a 500 error page. This gave us no data right. We know nothing. What is the dream?

A

What would I love to see? We are seeing a 500 error when we visit this URL give us the URL we've looked in the production log. We found the trace for that URL. Here's the trace now when we have that we can immediately get to work and say, we've seen this before. Have we not do we have ideas? Potentially, if we know the problem fix it solve it point them to where it was fixed, and this could be a OneTouch resolution.

A

You know this is super super valuable anybody can do it the other way at any given time. You know it's very easy to do this and when you're scared and in a rush, but we encourage customers to kind of take that 30 minutes. Take that 3-step. Try right, try and understand, is this load related? Is this a bug? Have you found the error message?

A

We are doing as much as we can to surface that and as we surface that that makes it really easy for us to the bug so I'm gonna move on to some more theoretical questions. This was my scenarios that customers will often run into that will cause problems and how to think about them. Unicorn giddily sidekick unclear problem descriptions are the top four things that support deals with and if we can get more clarity on those from the customer we'll be able to help them much much faster.

A

So questions on that before I move on to kind of talking about some meta, gitlab related things you.

C

All right, good, but I think we. Unless there are questions we have about 12 minutes left. Okay,.

A

Great I couldn't blow through this last three sections really really fast and probably have about five minutes left for questions. So customers, often if they're sharp say ah get labs, running Postgres like I, could just talk to Postgres and get stuff, and do that and dangerous right very dangerous. We want to tell customers, try and use our API if it doesn't exist, especially if you're talking to Tam our customer success in any way, Hey tell us what you need from the API support.

A

Engineers have been making a lot of contributions to the API recently to try and make it more robust if it's still, if they're in a rush or they need something. Db query is reading from the DB that's safe. You can read from the DB and I'm not worried about you.

A

If you try and edit DB correctly, I will haunt you forever so tell them, don't do it, because example, if you change something in the DB and get lab doesn't know, gitlab often has a lot of cascading things that if a value gets changed and in the app does it it will do some other things. If you just change in the DB, those other things may not trigger or fire and that's a nightmare which is oftentimes.

A

Why support engineers will use the rails, console, try and use as much of the app to trigger those auxilary effects if you will, but we don't want customers in the rails, console by themselves unless they're sharp and they have a subset of command. So they know what they're doing or we told them. Hey run this command. It's just read-only or hey, run this command. It's dangerous! We've netted. It know the dangers but like you need to run this, and so that's on thinking about getting data out of get lab.

A

Moving on into integrations customers often integrate with get lab one of the biggest things that I think about is if the integration is there 85% of the time it works, but the problem is the person who set it up. Isn't there anymore? They don't know how it works. They don't know how it's documented. We have about three different subtly different ways to integrate JIRA and other tools like Jenkins, and things like that. Depending on how that set up, you want to make sure that the customer knows how their integration is working and what it provides.

A

The other bit is they'll upgrade the integration without thinking about the implications for gitlab and then, if there's a version, mismatch or some feature disparity that causes problems, it's not something that we did it's just. We need to work with you and understand, don't willy nilly upgrade and if you want to think about integrations and doing that, you're Tam is probably the best spot to start.

A

If you have a tam, because we want to understand what you're doing and figure out how you're integrating, but very, very rarely do integration bugs live on because it either works or it doesn't, and we know about it and support needs to fix it, because if not we're gonna deal with it forever.

A

So we would think about scalability, and we had some questions that I layer about a che. When we think about this architecture, you can divvy it up many many many different ways, but the things that you kind of want to think about as you're doing that when it comes to scale performance and stability. These kind of three-pronged triangle right, the number one thing that is gonna, give you get lab performance because we're using git and we need to access, get data. It's gonna be disk speed.

A

You want your git layer to be blazing fast, the faster you can get it the speedier you can do that the best performance, you're gonna, get bubbling up then from there when you're thinking about the rails app, you want a fair bit of CPU so and and spread out so that you can get those concurrent requests in right when you're thinking about pure performance. Again, it's the same thing fastest disk, the biggest amount of CPUs you can throw in there and that you can afford and the most the widest do. You want to go we'll.

A

Let you get more performance again, depending on the constraints, is gonna dictate what they get. We in support just built this new tool, we're calling fast stats, we're trying to benchmark and and we're gonna be working with customer success on this. Every version from here forward we're gonna have a benchmark from git lab comm and we want to compare customers to get lab comm and we hope that they're within 10% of lab comm speeds right and where they differ, then we can pinpoint and understand more about their environment. Oh, why is this this section slow?

A

What is the differences? Data-Driven decision-making is where we're trying to get to our customers are resonating with this. It's working really well so we're trying to deliver on that and then, when it comes to stability, if we think about this graph, stability is is really dictated by resource contention right, if, if everything is isolated from each other, now you've resolved that, and if everything has enough resources, then you will be stable right if you have a stable ground and you're not fighting with anybody.

A

There's stability by default, where things start to shift to depend on on volume and things like that. But again, if your disc is not consistent and erratic or your CPU is constantly under load from various scanners or other software doing things, that's gonna dictate stability and that and if you have load that shifts or ebbs or flows or things like that, that can change your stability.

A

We need to think about this as one element doing that all together, because it's you know you got to pull it in subtly different ways to figure out what you want or need, and the last thing I want to share some some relevant dashboards from gitlab comm customers are always trying to understand. What are we doing on comm the biggest thing that I want to tell customers? Is you maybe get lap comm a lot of our customers?

A

Don't have the team or the scale or the need to be gitlab comm, so the thinking and the things that we're doing there might not make sense. For example, we have these rules for thresholds for alerting for Galib comm. Those thresholds may be insanely high for one instance, where a customer has you know at less than a hundredth of our volume. You know. So it's one of these things, we're keeping that in mind where there is a lot to learn, and there are a lot of things that you can see what we're paying attention to.

A

But you also don't want to make this the end-all be-all. We're gonna use it for performance right because we want to make it lab calm the pinnacle of performance, but from there customers may not need all of the things that we're monitoring on get lab, comm or whatnot, and it doesn't make sense to try and model their infra after it, but as long as they're in froze as fast as it then their experiences as good as we can promise so questions Thank.

C

You Lee, that was, that was really awesome, just real, quick question on that last slide as far as griffons dashboards, when you, you know, load up Crisanta for the first time now that we package it on the bus, does it come with like a some default metrics to begin observing, or do you need to install the config to get all the metrics all set up, yeah.

A

So it depends right. My understanding right now is we have, depending on the customer, if they don't have grifone are bundled in and that startup experience I, don't know exactly what it looks like now. But what I do know is that we do have this export of our graph on our dashboards that they can grab, and then we have what we call the omnibus dashboards which aren't linked here, that they should be able to directly import and should just work.

A

If you're finding that that's not happening, then we need to work with the monitoring group, because that's that's our vision, but going forward I, think from probably I'm gonna. Imagine, 12.2 or something core fana out-of-the-box should be pretty robust and and make it really easy graph on our Prometheus. All of that should be really straightforward for customers. So does that.

C

A

To answer yes,.

C

But the big takeaway is if they do use we've published comm. They just need to understand that those thresholds are scale to calm and exactly.

A

Those thresholds might be scaled also, for example, the dot-com dashboards have some com specific things in them with with the routing and the way that they're rendered, so they might error out on first try, whereas the omnibus dashboards are designed to work generically for an omnibus install and that should work very smooth, so we're trying to make those omnibus ones that we load in by default work and the dot-com ones will be the ones will they may have to tweak, but anything useful from comm will probably land on those omnibus ones anyway.

A

So support is really trying to drive forward with with using the graphing the graph on a data driven because, for example, a customer will say gay lab is slow. Get lab is 31 plus components. What is slow? Where is the slowness? What are you experiencing, because we need to pinpoint and understand how, where? Why? What you know so other questions, other thoughts, team.

D

Not so much for you, but more just for the documentation that goes along with this. Can we.

G

Not see, can we link to the Omnibus graph on uh dashboards lead mention some documentation for that. Let's make sure we include that in our you know, yeah.

A

Totally and I'll see if I can add it to this slide deck just for reference, but it will be it's I, don't know where they're out off the top of my head, but I'll see if I can find them. Okay,.

G

D

This is greatly I.

G

Really appreciated this, it was a. It was a really good. Deep dive really helps this.

A

Is close to three years worth of gitlab knowledge, experience and exposure bundled into one a jam-packed hour? Hopefully it was helpful I'm glad.

A

Thank you for that and I'm excited to see what your team can do with this, and we want to try and figure out how we can make all of our customers sharp customers and help them understand where the problems might be, because as they do, that that's gonna make our lives better and the faster we make that loop support engineers will write and fix bugs and scaling problems, and things like that, so it's it doesn't even always have to go to product we want to.

A

We want to be as a resolution center to help customers as well. So there's a lot of value in getting them up to speed, so we can run really fast.

A

Cool everyone, if that's the case, enjoy your Tuesday. Thank you. Thank you. Thank you. So much for your time, I hope you had a good one and if you have any other questions, feel free to come in to support team chat or support managers channel and slack and we're happy to answer whatever we can. We love working and partnering with you all because you're great and we want to make sure our customers get what they need. Awesome.

C

Thank you to our organization in our Google Drive, also get it up on get lab unfiltered. Once we take a look at the nuts but I think we're all good, perfect.

A

Bye Oh nuts, calm.