GitLab 15.3 Release Kickoffs, 14 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 15.3 Kickoff - Enablement:Memory

Description

Kickoff for the Memory Group for the GitLab 15.3 release

Planning issue: https://gitlab.com/gitlab-org/memory-team/team-tasks/-/issues/118

Memory Group Past Kickoff Videos: https://youtube.com/playlist?list=PL05JrBw4t0Kq1HDOIfQ8ov6lfyJkWK2Yr

Presentation by: Yannis Roussos, Sr. Product Manager, Memory and Database Groups

A

Hi, I'm dennis russo the product manager of the memory group and I'd like to take you through what we're planning to achieve in gitlab 15.3, which scheduled to release on august 22nd of june 23rd.

A

So our top priorities are more or less the same as the ones we had for the past few milestones, but we have seen traction on multiple fronts. So we have. I have multiple new, exciting things that we are going to discuss, so the first top priority is investigating the puma locker memory used. Puma is our web servers and who have observed that the memory of the service service kept on increasing uh during uh when we did not have any uh deployments.

A

We have fixed this problem, but during that effort, in order to continue and figure out, what's the main memory runaway problem, we have identified the need to automatically collect diagnostic data from production instances.

A

The reason for that is that we cannot continue with such a deep dive uh by only using, for example, prometheus metrics or logs that we stored in on cabana. We have to run those diagnostic reports like, for example, j, malloc, ruby heap dumps, etc.

A

A problem there is that, in order to do so, we don't have access to production servers, so we have every time to ask necessary to help us, so we want to build a way to collect those diagnostic reports to generate and collect. So in the first iteration, we will focus on producing those reports. On the random woman instance.

A

We will worry about collection in a later iteration, so at this first iteration, the collection of the reports will still require necessary to help us and to fetch them and analyze them. The interesting part here is that we have to assume that most of those reports and diagnostic tools incur a significant cause and may interfere with certain user traffic, so may interfere with the availability of our nodes, which is the most important thing. So we have to make sure that these reports are generated and collected in a way that minimizes impact on node availability.

A

Our second effort on this part is also identifying the use of diagnostic reports, especially the ones that require ssh on the nodes, so, for example, generating various different reports. Like the ruby heaps, the dams process, maps, sit down, etc and then running a few sessions doing 50.3 to make sure which ones are the ones that we really need in order to be able to do our analysis in our investigations, next, one which is related to the previous ones as well, and to memory users in general is tuned.

A

The jman logo settings for gitlab.com the jmallop is a variant of the malloc library. This is the way that we, our code, uh takes memory. um So the problem here is that we have never managed to tune jame unlock uh for git laptops for kid laptop.

A

uh So, but now we have first of all, you have the j malloc stats and secondly, by uh when we are finished this uh effort, we will be able to also collect them. So we are planning to use uh to use those j, malloc stats, to fine-tune the settings of j malloc in gitlab.com.

A

The thing there is that, because it is not tuned, it can result in ever-growing memory usage for gitlab.com next, one, which is also there is a an effort that has come off uh from the investigation for the puma long-term memory usage is considering replacing the puma vertically, and this was initially we figured out that puma did not work correctly. That's why the uh the memory was increasing, so the the thing, the main task of puma earlier.

A

It is a process that monitors the memory that uh other processes are using and when it passes at a certain threshold it kills them it reaps them and also occasionally rips workers based on on a timer so that they are refreshed the problem there is that, first of all, it uses rss, which is not a poor measure of real memory.

A

The second problem, which is even more important for us, is that those memory limits are static.

A

We end up very difficult to tweak them, so we set them once and we cannot change them dynamically, for example, and the third problem is that, even when we change them, we have to change a lot of things inside so, for example, if we change the limit of how much memory a worker can take, we have to go back to omnibus and say it change how many puma workers should run by default and many other changes.

A

So our approach here, our idea here- is to add a new memory once for puma, and the idea here is that, instead of doing a static memory, limit is to use heap utilization instead, so how efficiently the workers are using. The hip are using the memory, and our uh idea is that high memory use is not the bad thing bad thing as long as it's used efficiently.

A

So our plan with this new memory, what's, though, is to make sure that all workers maintain high heat utilization, so they use the memory as efficiently as possible, but at the same time avoid avoid not memory saturation by allowing the workers to expand too far into available memory, because the problem there is that, if you expand far, if they use too much memory, you may have the linux memory killer, jumping in and killing them the workers in a way more uh bad way.

A

The other top going to our next top priority effort is that optimizing, the workers that consume a lot of memory and can cause out of memory fields. There is a theme here, so this is a work we have uh done for a few months. In the past, we have figured out that there are workers that use too much memory and we have so, for example, some of those may use more than a gigabyte of memory or even go up to six gigabytes of memory.

A

This is both unefficient and also resulted in those workers been killed and have uh to be retried. I'm talking about more than a thousand out of memory kills per per day, observed on our background queues. So before we continue, we have optimized a few workers already.

A

We have decided to first improve monitoring and metrics of workers, because so that, if we have a better understanding of those metrics, we'll be able to figure out which workers have issues, because at the moment we are a little bit blind, because we don't have all the metrics that we need as well as there is the problem where, when the linux out of memory killer, jumps in and kills a process, this can happen very fast and the process.

A

The worker may not even have uh the time to log in to add the log that I'm being killed, because I use too much memory or there are other issues like, for example, in the node, where you have multiple threads. Multiple workers running one worker consumes too much memory and then the whole server. The whole node is killed and you see six vocal skills and you are not sure which one was the one that was causing the problem that that can be.

A

We can dive into that, but causes a lot of noise for us, so the idea here is: can we use different in other ways uh to monitor uh those problems? One idea is to use the sidekick memory killer. This is similar to the puma memory killer, which runs. It does not run on a lot of um types of nodes in gitlab.com.

A

It runs on only a few workers types of workers, and the idea here is to enable it everywhere and use it um instead of allowing it to kill the processes to use it as an early warning and load that there is a problem before the linux memory, uh killer gets in and kills the process so that we have more details more data. There are two things here.

A

The first one is that we have the idea of using this new memory whatsoever and making it generic enough so that we can use both on puma and side quarters.

A

So maybe we don't need to use the the sidekick memory killer on that front, so we will first work on this, but the second thing that we want to also do before moving forward is to identify why we also see the the scientific memory killer not not being triggered before uh pods are killed by the linux memory killer. So we want to identify that and understand what happens with our memory allocation and other uh triggers that can cause pots to be killed.

A

The second uh way of addressing this is to use the gitlab scientific, reliable uh feature. This is an abdominal on sciatic that allows us to to track the interrupted count to track how many times workers have been interrupted, like, for example, when the linux out of memory killer gets in, interrupts the process and kills the the pole, so we will investigate this as well. Finally, our last top priority is the creating the custom slice for global search. This is to increase the visibility into how our third global sets works.

A

We are deep into the implementation. We expect to finish that so that that will allow, by the end of 15.3 the global search team, to have a way better visibility on different types of searches based or advanced and different types of scopes with separate methods. That's it thank you for watching and talk to you next month.