GitLab Support Training Videos, 27 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2022 GitLab Debugging Techniques: A Support Engineering Perspective

Description

2022 version of a video originally made by Lee Matos.

Link to presentation: https://docs.google.com/presentation/d/10SpbXwBy5f_zQ42RJexOuquRQ7CcsFcL4dKwev6ErIc/edit?usp=sharing

A

Hopefully this works, hopefully it records to where it needs to go. So we'll start the video here, um I'm Matthew.

A

uh This is a git love, bugging techniques, video techniques, video, get lab, debugging techniques uh from a sport engineering perspective, uh we'll move to Justin. Actually, next um Justin can you introduce yourself.

B

Sure my name is Justin family I'm, a support engineer in Auckland New, Zealand and I've been here for coming up two years now and then he claims to fame. I could think of about myself that I'm, probably the least youngest get LED support, engineer on the team and that I'm quite easily confused. It's it's great to work through these debugging techniques and things repeatedly for me.

A

Okay and I'm Matthew I joined gitlab in April 2020 wears glasses, not as coarsely and I've worked on a lot of tickets. Although I can't claim that I'm an expert on any specific thing, I just worked on a lot of tickets.

A

uh So this video is about GitHub debugging techniques. It's originally focused or not targeted. It's the support. Engineering Group is uh the people in mind. That's who we're trying to uh make this presentation for, but it's for everyone, so we're going to use terms like customers and uh point to specific channels, Within uh slack that we'd be using as a support group but they're open to anybody within gitlab.

A

If you're viewing this presentation outside of the get like organization I'm, sorry, you won't be able to view some of these internal channels or discussions, uh but they're there uh for us at least uh I want to mention that this is kind of uh version. Two of a video that Lee had made uh many years ago uh in this slideshow was an original video and original deck link and I'll. Try to make this presentation accessible outside of the GitHub organization, uh and then you can view it later.

A

uh The reason why we need I felt we needed to remake this was because the original video was showing its age. It had a lot of great information, had tips on uh things to look out for, but a lot of those things are not really useful anymore. There's just a lot of changes since the last one. The last presentation uh excluded a lot of didn't exclude it just didn't, have a lot of information. The last presentation was made around version 11.. We have a lot of new things.

A

A lot of this was already on the roadmap in version 11, but these are things that have been either more mature or been added, or things just make it easier for us as support engineering to do our jobs and I just noticed that little last bullet point is a little bit odd, so I have to fix that later.

A

um So, we'll start with the common problem, areas and I'll walk through um in the original presentation. Lee had this big uh chart, but with the gitlab architecture- and this is available in our docs- you can I always say check the docs. uh It's part of our troubleshooting too just check the docs, uh how I always reference things uh so in the docs? Is this architecture chart in Lee's original video he'd reference this chart and point it to various uh portions of it as he was going through this lock resection now, I?

A

Don't really want to go through this chart again because it became massive, it's huge now. So, if we zoom in this little section, this Puma section we'll see that it's just this tiny piece in this massive puzzle and gitlab is just huge now. So uh this I want to mention this specifically because we won't be able to walk through everything.

A

We won't have every single problem addressed and there's so many different facets and different pieces of git lab that we won't be able to do everything, but we can go through some of the common ones, and many of them are still similar to the ones in the original video. uh But we won't be doing any of this architecture. Break anymore. Waming actually has another video about architecture, order, one that you can reference and use as part of our learning.

A

If we're using this as a learning presentation, so we won't be doing the architecture breaks anymore, but it's very important that you take a look at this documentation and use this information when you're debugging gitlab, because, for example, there's a certain component, that's not working as expected. You need to. You can use this to check to see what what might be the cause like.

A

uh What component is connected to what so uh we'll start with the scenario, for example, Puma errors, so uh Puma is kind of the core app rails. It is uh things that it controls the core app of gitlab uh they're in troubleshooting this they're four logs. We should pay attention to um and these are available in our Docs, but we almost always have to look at these types of logs whenever we're troubleshooting, almost anything gitlab.

A

So whenever we request, information from customer will almost always ask for these logs unless we know exactly what's going on, but there's a little asterisk in there. uh So, for example, we'll see this Puma memory killer, we'll see in the Puma I see the outlog. uh This is a very common error and there's a way to fix this. Within the logs, and generally just to increase the memory, if this happens too frequently or if it happens uh consecutively through many many workers, then this is a problem.

A

It should be addressed and we can take a look at the docs to see uh how to increase it. This is something that has popped up again over the years. This is something that was a very very common a couple of years ago, kind of fell off, because our memory usage was much better but started popping up again more recently. In my opinion, um this is just something that kind of pops up what's old is New. Again is something I think about when I see the Puma memory killer, errors.

A

uh So in gitlab we see 500 errors and 500 errors are server-side errors, so, for example, 502s it's just Puma's not available.

A

Something failed. The expected uh thing that the thing that you expect failed essentially, when the Puma app was trying to retrieve some data or something for the app itself. It failed in retrieved in the data through timeouts or it didn't match the expected uh data in the tape in the database or something or it just didn't exist. This is a common error, uh so this is something that we can see in the logs and 500.

A

Errors are I would say easier because they provide a full trace of the error and there's something we can kind of troubleshoot and narrow down a little bit quickly.

A

uh Additionally, when you see a 500 error in the UI there's always a correlation, ID and I'll talk about this a lot a lot later, um especially about correlation ID and logs another one is deadline exceeded. This was a common error quite a few years ago, and mostly attributed to giddly being slow uh because of disk problems. Either the disk performance was not keeping up with what Kidley demanded of it or giddly was just requesting too much.

A

I I mean this essentially the same thing worded twice, uh but nowadays it's more of a red herring error. We'll see this a lot uh in the logs just uh move past this when you're. If you see lots of these in the logs, uh it could be a problem. But if you see this once in a while, I wouldn't rely on this to uh tell you whether or not something is correct, and if you look at the timestamps, you can see that this is from Lee's original presentation from 2019.

A

So this is a very old error, but it's kind of new again it's popping up, but uh uh look Beyond this one, because it's not always uh disk performance error nowadays.

A

So uh lee has this really great concept about uh scared, customers and Savvy customers. Scared customers are those who request a problem or tell us about a problem that they're experiencing and they don't always include the full problem description. For example, uh we were seeing a 500 error when we visit a repo and all repo is not very clear. Is it all repos? Is it one repo? Is it a group? Is it? uh Is it pushing or pulling it's, not very specific or users can't see stuff.

A

Please help was an actual error we received just a few days ago in an emergency ticket. uh So these are scared, customers that don't fully understand the problem that they're experiencing and we want to make them more Savvy customers. uh So what we do with an unclear problem is we start with a problem description? What is the user describing? Were they expecting to see what and where is the error occurring, uh so they have 500 errors uh or 503s? What are they experiencing? We get a gitlab Osos, which is a set of logs.

A

This is a really great project and I'm going to mention a few times. It's going to be mentioned quite a few times in this presentation, because it's very, very good output uh providing a full trace and correlation ID. You can get the full Trace, either from well from the logs, if you're getting the the full Trace, it's definitely from the logs correlation ID can be presented in the UI or the logs.

A

So when the user sees a correlation ID, we can look that up in the logs, and this is some somewhat of a new thing, seeing the correlation ID in in the UI, presenting that to the user, asking them for this correlation ID. So these are things that we request from users.

A

um If they don't have this information, it just takes a long time to debug. We just don't know what it is. uh It's really hard to troubleshoot a problem when you don't know what the problem is, when you take your car to the mechanic- and it just doesn't make that sound anymore, you just don't know how to fix the problem, because you don't know where the sound is coming from. The mechanic doesn't how to recreate it.

A

It just uh really difficult to do, and also we have to define a common language since there's terms defining a common language is important too, and I really want to mention this one, because it's not about English, it's not about getting the same language uh like spoken language. It's more about terms! If someone uses a phrase about their Runners, not working expected, uh they they need to be clear. Is this like a gitlab runner, that's not running, or is this something in the pipeline?

A

That's not running on the runner, defining a Common Language and where things occur really matter, and we need to give the customers from that scared state to a Savvy State, and this is the way these are ways we can do it. So, for example, a good problem description is, we are seeing a 500 error when we visit this repo, so sometimes they'll have an image link to a 500 error, but if it doesn't include that correlation ID, it's not very useful. It's just that.

A

We know that there's a 500 error happening if they include a timestamp, that's better. So if they include this correlation ID, it's good if they paste it in the tickets, even better. If they have accompanying logs included with the ticket when they they uh submit the ticket, that's that's great and if they Pace the entire stack Trace, that's a great way to start off a ticket to debug a problem. We can't debug a problem unless we know what the problem is even working towards that problem.

A

We can't find good solutions to work towards that problem unless we know where to look and if a scared customer does isn't able to tell us it just delays everything. So we need to make the customers into more Savvy customers, so troubleshooting the problem or debugging the problem. uh We have several ways of doing it.

A

In the past, it was always recommended to do things like check a database or the rails console or API and which one should we use well API, if, if you can, for example, if they're trying to query a bunch of Mrs and they're, just not displaying on the page, use the API to see if you can query those Mr Mrs during the uh normally and if they return in the time that you expect.

A

So if the Mrs just blank page of Mrs and showing a little bit of a loading screen, you use the API and it shows up fine, maybe there's something else happening, maybe a UI or ux issue or something. So we also want to choose between rails, console or postgres and I recommend going when you're most comfortable with, because you don't want to make mistakes, that doesn't mean be afraid. Definitely if you're not comfortable with either or even just one or the other.

A

Sometimes you need to use both make sure that you uh get someone to help someone that might understand a little bit better or just run run it by someone. uh It really depends on the circumstances and I I just want to make sure that um to start, if you're with a scared customer, you don't want to start uh giving them a bunch of commands to start pasting the chat and uh or start pasting into their console and then just start running without them understanding it.

A

uh We want to avoid that and if you're not comfortable using those commands, then uh go through them.

A

So, for example, um for example, rails console move to the browse console because I'm most comfortable with using the rails, console and I think a lot of us in sport. Engineering are the original uh presentation that Lee had made had talked about using um sorry. My screen is jumping.

A

So the original uh presentation that Lee had made talked about the rails console uh being kind of a dangerous place, but the rails console has improved lately and we have um we've really gotten it so that uh you can stay kind of within rails. You don't go off the rails with the active record, especially it's hard to break things.

A

So, uh whenever you're talking to a customer and trying to work through a problem, make sure you test the commands ensure that the customer is not copying and pasting, all of them at the same time go line by line and paste them have someone to look over your commands.

A

um The reason why I mentioned the line by line thing is because, if you paste it all in one big glob, sometimes it could throw uh could have formatting errors that you didn't expect like Carriage returns in the wrong spot, or something is happening too quickly, like a a read command that is being posted. We need to verify that the the things that we're pasting in the command are the things we're expecting and I.

A

Think that's part of the tech get the debugging technique is that we we want to find an expected State and putting uh there's random commands in the rails. Console is just not the way to get that to the unexpected state.

A

uh So, for example, the the rails active record when you're looking at a model like a model of a code, for example, if you're pushing an MR for displaying an MR and you're. Looking at the model of the Mr of How It's displayed in a UI, you have sets of data that you can look at within the rails console and if there's, uh if you're, just viewing the rails, console and you're looking at a certain set of data, and it just doesn't display or errors, it's a good way of telling where the 500 error happens.

A

um You can also recreate the 500 errors, sometimes within the rails, console if the data is unexpected or stored unexpectedly, and uh to use the pp and the model to display a pretty output formatting.

A

We have information on the postgres council, but I'm going to leave this one to Justin to talk about a little bit, because I am personally not comfortable with using the rails console because it could be dangerous but Justin. Would you be able to talk about their house Council for a minute.

B

Sure so, when I started again, I knew my way around hi Chris, uh post Chris reasonably well, and you may worry about rails, not at all, so that it was the one that I started off using to look at them.

B

Let's look at that or the database, but I I do totally agree um with Matthew that if you can get the hang of the Rails console for a lot of applications with debugging things and testing, things is the way to go, because it has all the logic built in there, as well as to what gitlab does and how the various tables are grouped together into relationships and things so um but post credit sales come in handy just for checking.

B

Sometimes uh what is in a um was in a table, or else what the schema of a table is um to make sure it's got all the columns and indexes and things that at once so um I mean you can make updates to data and gitlab through both the rails and the postgres console. But the postgres console will allow you to make changes in isolation from other things. So you could easily wind up updating, uh merge, requests, but not updating some other record that needs to be uh kept in sync with or consistent with it.

B

So I do suggest you don't use postgres for updating data deleting inserting is just for looking at tables and schemas um unless you absolutely have to and when you do have to up, and there will be a know and work around for a problem and it will have been tested and tried by other people. So you can be reasonably confident that it will work but always be aware of what you're about to update.

B

If you do, the BET make sure you verify it for a select the same, we're closed so before you go ahead and do an update or a delete or something like that um to get into the database console from Standalone, Omnibus, installation and or Docker installation, you can just run get web Dash, psql and it'll connect to the local database.

B

um If your database is hosted externally that won't work, so you can use different rails. Db console minus minus database main instead. That will ask you for the um password for the configured good lab user as per the gitlab.rb file. So you'll need to know what that is uh to get into the console um and really Danny took my head to that for using for using the console as the backslash X option, which will change the output formatting um sort of vertically rather than horizontally.

B

So you don't get so much wrapping of lines of output and it's a lot easier to read.

A

And that's that's that nothing. So uh one thing too, is that the I didn't mention it.

B

A

Think it was the other side. So here I wrote that migrations are direct database changes uh and when you look at the migration files they you will see that they are database entries, insert into table or uh create a new table or something along those lines: I'm, not postgrades, but migrations. If you have a failed migration, postgres console is a great way to troubleshoot that one uh so I'll, move on to logs logs are very important.

A

I listed it quite a few times in the presentation, uh because it's it's very, very important uh now the slide show is only one single slide about logs, but it's probably the most important thing, because we really need the logs to trigger try to figure out what a problem is to see what the issue is. We really need to find out the logs.

A

We can get information from a 500 error and we can try to get some information from the findings. The logs are most important. So if you check the docs I included a link in this presentation, the docs, uh the docs check the docs, because it includes a list of all the logs. We have and examples which are really great because you can see what an expected uh log out was supposed to look like it's the expected log output um and then the correlation ID, which I mentioned several times so far.

A

If you see this within the browser, you can get that correlation ID from the user and you can search the log support. This is important on SAS too, because when a SAS user has an issue and they they give you that correlation ID, you can go into our log system and look it up when you you have the correlation. Id you'll, see it across multiple Services too. So you can see whether uh Puma was behaving as expected.

A

If giddly was behaving behaving as expected, correlation ID is very, very important and you can see that in the API too, if you're, using that with the X request, ID header, it's it's really really uh good. I use this a lot too, when I'm searching through logs and I see a 500 error occurred. One place that I want to see if this is attributed to the same problem, that the user is reporting or is it just another? 500 error that they happen to be experiencing: are they even related?

A

So uh it's really really good to get this correlation ID when you can, when the user can provide it so stack, traces and back traces are great because they can show you where the code failed, and this is really cool working at gitlab, because we can just kind of Link directly to where the code failed. uh So you can go on gitlab.com and then find the version of the user uh is using.

A

You can find that where the stack Trace fails- and you can just point to this- this section of the code is where you're having a problem, and this is useful too, because you can use that that section of code to find the model and go back to the rails console and then just check to see if the model is uh presenting as expected, and you can just query that model and see if uh or query that record I mean and see if uh they're still throwing errors or try to repair it using that uh we'll leave repair it to another time, but logs logs are vitally important to figuring out the uh the problem and if you just want to skip to a single log exceptions, Json log is uh the best log to use um I.

A

Think anyway, you can just kind of skip to it. uh It has all the Puma uh in gitlab dash rails uh folder. You can see the exceptions Json log, because it'll just show all the exceptions and across application production and uh Workhorse logs too I believe and I mentioned the Json logs, because they're somewhat new compared to version 11.

A

they've been more and more um they've. Been added or improved, more and more since the the previous presentation um this this is just uh this is really important like if we, if we don't find narrow, you just can't find the root cause.

A

uh It's just very, very hard, so logs logs logs logs just keep asking the customer for the logs if they can't provide the logs, for example, they're on a closed system, uh ask them to look for the logs and just know where the logs are at. You can even Point them to the documentation, tell them where this the logs exist and then that then ask them to look in the logs for that type of error.

A

Look for a specific type of thing, like a 500 error or use that correlation ID, that they got from the browser and then look for that in the logs. You can ask the customer to do this. If they're I've worked with customers that just can't share any of the logged information, they can't do a gitlab SOS. They can't share even just snippet of the logs, but if we ask them to find the logs, they can provide small outputs that are very much redacted.

A

That we can use to try to find the root cause of the error and, with that I'm going to pass this on to Justin and.

A

Justin you should be able to share your screen and I will stop sharing mine.

B

There we go I think.

A

B

Right all right, thank you must be a little bit patched after all that talking um so.

A

Section four tools available: there are tools available.

B

And they're good to use so.

B

We've mentioned this already a few times, gitlab SOS. So this is a really really important and helpful tool for us in support and it saves us and the customers from a lot of uh time in the first instance when investigating a problem, and we want to know as much about their environment as possible and as quickly as possible. So this is a project and there's uh links to it in the uh handbook, and um you can just go to this. You can run it either by cloning.

B

The project and running the command locally on the gitlab instance, or you can run it directly via curl without having to decline anything so the convas OS when it runs it takes 30 seconds or so to or a minute to run, and it does um a couple of key things.

B

uh It'll take a bunch of snapshots of various metrics from the system such as PS and VM stat, and I o stat and those kinds of commands um just to try and get a picture of how the OS is performing at the time and what kind of workload is under.

B

um It will also gather a bunch of system information, um for instance, where the SC Linux is enabled, which is a really useful thing to know, because some very obscure problems can be caused when BSE Linux is misbehaving.

B

uh It'll do a listing of CPUs how many CPUs memory disk space, so you can tell if the system has run out of space, for instance, which might be causing the problem, uh and it also collects um a copy of the current log of all the different get lab services that are on the system, as well as the OS syslog or messages.

B

So all sorts of things in one place, uh with one documented way of for the customer to create the file and attach it to the ticket and then you're off to a good start to to try and get to the bottom of whatever. The issue is.

B

um One thing to mention is that the latest versions of good lab SOS, if you run it uh after cloning, the project, so the first sort of way of running an app so is it will include a sanitized copy of the get their blood out in configuration uh in the SOS, which is also really helpful because that's sort of the other thing we tend to have to ask customers for so we can see what their current settings are, but just.

A

Note that it won't.

B

Do that if the customer runs it directly using curl, uh so that that file may not be included and you'll have to ask for it separately?

B

um Yeah I was just important just before we started here that uh one of our Engineers Kenneth was actually in the process of working on enhancement to get their bsos. That will allow you to specify a Time range for the logs that you want to include in it. So, as I said at the moment, it will just include the most recent, the current active log file for each service and on a busy system.

B

Those files can get rotated uh very quickly, so we do always recommend the customer reproduces, whatever problem they're having and then immediately runs, we've got lab SOS so that it will have those errors in the current log, but often even if it's around 15 minutes or half an hour later, you might find that the time period involved is not included. So that will be a a good enhancement when, um when it gets released,.

B

uh When you get it back, it's a it's a compressed tar file, so you extract it to your to your local machine and it extracts it into a hierarchical, folders and log files and things. And then you can visually inspect.

A

B

Yourself um or grip for information out of them, but we also have some other projects that people have created to help with the interpretation and housing of those SOS files.

B

So the two key ones are fast bets, which is specifically for extracting performance information from the get their blog files um and it'll show you a number of operations that are different types of operations that are performed how long they took where they spent their time, whether it was in database access or queuing or CPU, and that sort of thing and the requests per second involved of that operation. So that's a way of, especially for performance issues. When customers say the system is running slowly you can check and see.

B

Is that one particular operation is running very slowly. Are there hundreds or thousands of operations requests per second being issued for it for some for something which is just overwhelming the system uh in the information like that, you can get out, and you can also compare particular log files statistics to the benchmarks um for that version of gitlab to see if it's sort of behaving as as expected or not so, there's lots of documentation around how fast Tax Works it can produce graphs as well, showing um showing the metrics.

B

So it's a very powerful Tool, uh bringing performance related issues uh and then green hat is another there's another one which is a sort of user-friendly, um take space interface into a bunch of options to pass the log files examine the information. That's in there print out the system, configuration and metrics and state in in the easily read format and a whole bunch of other ones.

B

So it's well worth having in Your Arsenal of tools as well.

B

um One thing to remember, though, with this OS is it's very: it's often the very first thing we ask on ticket and it's a customer's one of those very uh special excuse. Me. Special ones actually includes SOS with the first of the first Contact on the ticket, but often the very first thing we'll go back to them and ask for is: can you please send us a get letter SOS from your instance now for a large installation, uh maybe using a reference architecture?

B

They may have upwards of 30 nodes configured as part of a um you know: 5000 user, get that reference architecture and even for the non-reference architectures they may have multiple rails nodes, multiple. They will have potentially multiple giggly nodes, multiple psychic notes. So when you ask them, can I get a bit SOS, please you do have to be a bear in mind. You might be asking them to do this across a dozen or more instances at once.

B

So it's still a very useful thing, worthwhile thing to ask them for and often that's the only way to get the information that just bear in mind. It can be quite a lot of work for the customer.

B

So, along with getting so this, um which is used for our Docker and Omnibus based installations, um we have kubis OS, which is used for our Helm chat based installations. um It's the same idea. It's designed to be a sort of a one-line tool that command you can run to get a file together containing a whole bunch of useful information about how your kubernetes cluster is set up and also collecting all the bitlab logs.

B

uh It's a project that you that you clone and then run the run the command um it does require you to run it from a machine that has a coupe CTL access to the cluster that it can interrogate and get the required information, and you do have to tell it which namespace in your class to look at levels installed, because for many of the commands that runs the namespace specific. So we've got neighbors into your default namespace and you don't specify the default.

B

The namespace you'll get a bunch of information back that isn't all that helpful, uh someone's key things it does include, though it includes the currently applied um pound chat, values that have been used to configure the gitlab deployment, um and it also includes uh the log files from all of the um different services that run in pods within the classes. So you have psychic pods you'll have web service pods. Definitely pod and you'll get a log file produced from each of those.

B

Now, unlike the yes's, which collect um the individual log files for each service in the single nicely formatted file, the kubernetes logs from a pod will include logs from all the containers running within their pod and you'll.

B

That means that, for instance, for the web service, part you'll have Workhorse logs mixed up with um rails type logs and the whole thing there's a little bit of a jumble of logs from different services, and if you do want to apply those logs to something like Fast debts, you will need to do some selective gripping of the lines that are relevant from from those files to get them into a format.

B

Their past steps can can run against, uh and the other really useful thing in the file is the um events logs from the cluster software with kubernetes.

A

The issues aren't.

B

Necessarily to do with how good lab has been configured or installed it's to do with the cluster itself, and it may be running out of resources.

B

It may be evicting pods that are because the memory in the environment is too too load and often you'll get information about those from the actual cluster event logs, and you can go back to the customer and say well actually this you need to increase the memory you have for your nodes are running too much too many points in there, you're evicting them um and as per the service, it is best to run this as soon as possible, after reproducing whatever the problem is because the kubernetes is likes.

B

Or yeah rotate, and you won't see the information that you need.

B

Okay, so uh this section is just about the different kinds of um deployments and ways you can deploy good lab and I guess. This has changed a lot over the years um as as more and more options are developed and made available.

B

So Geo troubleshooting is sort of a thing unto itself in a way, because it's not a Geo is our um multi-site deployment method, which primarily is about bringing copies of uh repositories and database information into different geographical locations that are closer to the end users to make things like cloning, and you know, pushing repos and things faster for people, because they're they're, um if they're another part of the world and they're trying to access the lab server someone far away, then it will take a lot longer.

B

uh But the other thing that Geo provides is a disaster recovery mechanism whereby you can have your primary site, your replicated secondary site that it's been used by people in that region and then, if something happens to the primary you can switch over to the secondary and not have any loss of data or much in the way of a downtime.

B

um So it's quite a lot of moving Parts involved in how that replication uh between the primary and the secondary is performed for all the different kinds of objects in the lab environment. So you have your database, which is being replicated by a postgres.

B

You have your repositories which are being replicated, and then you have all your different kinds of uh objects like uh uploads or that's their Snippets and and things like that, which need to be transferred from one site to the other or you know as soon as possible after they've changed.

B

So we have a whole um age. I've got lab troubleshooting tips and and processes in our documentation, and it's really worthwhile for a good lab for a Geo problem to start there and just work through those and see if they relate to the to the issue that the customer is reporting.

B

um Now, let's just go mention one one tip in the troubleshooting for for geo is to just reset the secondary site, which performs the full resync of all the data from the primary to the secondary um I. Think that's that's an option that that is certainly uh used to to fix problems and I.

B

Think because Geo is such a uh Dynamic, yes, it's being updated all the time bugs have been fixed and new features are being deployed that um possibly there are a lot more cases with it that is required, what's required in the past, as a as the last resort to to get things working um again, I'd say that these days, there's possibly less less necessary to go there and also, if you are going to suggest it, just bear in mind that it's sort of a you know nuclear option in terms of you're going to knock out the Dr side and it may take hours or days to get the sinking back in sync again.

B

So it's not something the customer may be there, keen on doing so, um just uh a bit sensitive around that when you're suggesting it and explore other options. First uh and all sorts of things can come into play when troubleshooting um Geo issues, apart from problems with gitlab itself, um there's a lot of performance aspects that can cause things to get out of sync and backlogs to develop. So you really have to be looking at the Italy prefix, Network and database performance at both sites.

B

Potentially, if there's a problem with replication, just not happening happening as quickly as it um as a customer wants or two or as it should be yeah, so reference architectures. So there's a lot of work again in this area has happened in recent years, so we have our reference architectures um that we recommend to customers as a reliable tested and benchmarked way to provision a gitlab environment for a particular user cap based on certain assumptions about what typical users do.

B

So we have um reference architectures going from 500 or 1000 users up to 50 000 users, and you can see all this fix for those in terms of machine types and numbers and architectures and the documentation.

B

um The reference architectures provide High availability uh now just put a star next to that. Just to remind me to mention that there is some caveats around that that are mentioned in the docs and one of those things is around prefect database.

B

um It's not it's not h a um and the reference architectures unless you post it externally on a on a database database platform that that is highly available, but otherwise, nearly as far as I know. All the other feature parts are good. There can be provisioned in a distributed way to make them highly available.

B

um Reference architectures can be Consolidated down onto fewer nodes without losing their High availability.

B

So if you're troubleshooting an issue, you might be tempted to say to a customer, oh you know: you've got you're having performance problems, you should deploy a reference architecture, have a look at the 3000 user one and have a look at that, and this is you need to deploy, 28 or 31 nodes, or something and I had a customer who was rightfully it's been about that suggestion, because um they just didn't have the workload as that's associated with a 3000 user system, but they wanted High availability or in their environment.

B

So um you can reduce that down, but the risk you take. There is just that the performance won't be as good as it um is guaranteed to be by the reference architectures.

B

um We also have the cloud native hybrid options where we combine kubernetes deployments.

B

Sort of stateless parts of gitlab, and then we use on the bus deployments of uh and Prospect to store a home repository data and that also leverages object. Storage to store the other information which external um and sometimes when it cut again when it comes down to Performance um and get LED having more nodes can be available alternative to just having a single larger nodes um and it might even cost customers less.

B

So we do have to work through what the customer's actual environment and requirements are and if they're having performance issues, then these are all different options that can be suggested um and hopefully link to in our doc. So the customer can do their own research and decide which ones are most appropriate. Events.

B

Oh so um yeah so probably mentioned some of this already um one.

B

So one thing to do with with large environments and and things like reference architectures is to um don't just assume there might be a single instance good lab so remember to find out um before you start suggesting uh things to do for their to address their issue. You can find this out by asking them.

B

uh You can also have a look at prior tickets, because often they will have had the same questions asked for them in the past by other support Engineers on other tickets and animations there, and some of our customers actually have architecture issues linked to from the help desk. So you can look at those and see a hopefully recent architecture, diagram and other information about them, uh as I mentioned before, be selective when requesting your services in large environments, because that can be a lot I guess. So this is um so.

B

If you only need to see sidekick logs, then just request your services from Psychic nodes, um and you can even reduce that down to ask for just the log files themselves if you're reasonably searching about what it is. You want to want to check and be aware that uh they may be. The environment may be using external external postgres and object storage or what they may be using those Services as they are deployed, um I get led.

B

The South uh has per the reference architectures options to say a whole bunch of possibilities there um and one thing to bear in mind. This applies to to any um gitlab installation in the cloud. Not just reference architectures, but for performance issues. Again, do be aware of the potential for a mismatch between instance, types and storage Types on there Cloud compute instances.

B

If you don't have enough storage, provisioned or storage of a certain type provision, then you will potentially have I o throttling applied by the cloud provider and that'll cause.

B

You know things to perform bad pretty badly at times in gitlab, and it's worth checking that out.

B

So troubleshooting, Cloud native or which is how we refer to kubernetes deployments of gitlab we've talked about kubi, so this um and I'm just seeing if there's anything much there I've been mentioned.

A

Already so yes, so logs.

B

Can rotate quickly um be aware of that? We have two files that get included in the Cooper. So it's that's it sometimes, because sometimes it seems that information. Isn't there um I'm not here, to go back and ask the customer directly for it, but that's extremely helpful, especially the user supplied values to see exactly what configuration values have been applied, um make sure your objective event logs as well in case the issue is not collab at all, but it's being imposed on it by the cluster itself, due to Resource limitations or other areas.

B

Network errors, DNA series that sort of thing an ability to pull down images from external places that and other things external to get there and um speaking for myself chat values for kubernetes deployments can be confusing.

B

uh I, often struggle to know exactly where they should be specified and and what you know what sub subheadings should be associated with, and so I make good use of a test test cluster that I have um to try things out and make sure I'm not going to tell the customer something and it is actually incorrect or misformatted foreign.

B

.Com or SAS troubleshooting, um so a lot of troubleshooting of issues reported by SAS customers is similar to what we do for customers who manage their own good lab environments.

B

But some of the processes are also different um and the one key thing uh we'll keep things that I might have to remind myself constantly to remember to do this when I am dealing with the ticket, especially if it's for something that sounds like it.

B

You know a key part of gitlab.com is not working properly and it's something that isn't specific to anything. A particular customer has has configured um is to check whether it's a known issue already, um because there's so many custom people using github.com chances are anything major will have already been reported and logged by someone else.

B

You can save yourself a lot of time by just checking in the slack with a Incident Management Channel, whether the incident has been declared relating to a particular problem. You can check our state of stock atlib.com page for similar information and there's also uh issue tracker called reliability, engineering team that records slower priority issues or longer running knowing issues, and the other key thing is to check for recent recent similar tickets um and that can save a lot of time. When it's been I've been out. You know now trying to figure something out.

B

Then I've just suddenly had the brainwave to check and sure enough. Someone else handle the ticket forward two hours ago and replied and all the information is there and I needn't have done anything. So I always remember to do that. It's a great resource of information about this.

B

If you do need to go hunting down a particular problem, um we can't just log on and look at the logs and we can't get any service run. So we have tools to. Let us do those things. Instead, Cabana is the tool for searching um how log files from all the different galette.com instances and components um against uh less research backend.

B

um You do need to remember when you go into that to choose which log Source you're interested in whether it's get to Lee or sidekick or collect rails um and, as Matthew mentioned earlier, having the correlation ID. What's your customity from the error page and bitlab, when it appears, it's really helpful to crack things down, and there is a correlation dashboard available in Cabana that it's huge again, the correlation, ID internet searches across multiple log sources that you need messages with that correlation ID in them, which can be a great time.

B

Saver, um bear in mind, there's a seven day retention of those logs, so you need to uh for an issue the customers reporting happen more than seven days ago. You need to get it reproduced to try and hunt it down.

B

um Century is the other talk, so Century will actually collect uh similar errors into issues and that's a good tool for seeing as a particular type of error happening a lot over a given time period across lots of customs um and, if you do think, you've identified an issue that hasn't been reported before then. There's processes in the handbook for using Century to create an issue for the uh site, reliability teams to look into and um see if the action needs to be taken to fix that.

B

I think that is the end of the slideshow Matthew. Would you like to say.

A

Anything Stevens, uh uh so the presentation, hopefully you've learned something. uh This is what we normally do in day to day at gitlab uh to troubleshoot issues and troubleshoot problems for customers and ourselves caleb.com, so uh I'm gonna close it here and stop the recording, take care.