GitLab 15.2 Release Kickoff, 18 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GitLab 15.2 Kickoff - Enablement:Memory

Description

Kickoff for the Memory Group for the GitLab 15.2 release

Planning issue: https://gitlab.com/gitlab-org/memory-team/team-tasks/-/issues/117

Memory Group Past Kickoff Videos: https://youtube.com/playlist?list=PL05JrBw4t0Kq1HDOIfQ8ov6lfyJkWK2Yr

Presentation by: Yannis Roussos, Sr. Product Manager, Memory and Database Groups

A

Hi, I'm jen soluso the product manager of the memory group and I'd like to take you through what we're planning to achieve in github 50.2, which is scheduled to be released on july 22nd of 2002. So our first top priority for 50.2 is the puma long-term memory.

A

Use puma is a web server that we are using in vietnam, so we have observed that the memory of the puma servers, the puma pods, keeps on growing when they are not restarted, and this is more evident during weekends, when we don't do deployments because during deployments of course, the ports are restarted and the memory is cleared, but during weekends you can see. This is the blue line. The memory keep on increasing. This is, in contrast to the yellow line.

A

Where, which are the side servers, I think, is the way we run background jobs in gitlab, which is pretty much flat and that's what we would expect.

A

So this is a clear indication that we have a potential runaway memory issue and in june 50.1 we investigated this problem and our investigation has generated multiple findings and has led us to three separate paths that we want to keep on working during 50.2. So the first one is, of course, the reason why we started this initiative. We want to find the origin of the growth of the growing memory use of puma when the boats are not restarted and fix it.

A

So we believe that the primary driver is keep for implementation, but we are not 100 sure that it's the only driver, so we want to make sure that our assumptions correct and figure out ways to address it.

A

The second path is, we want to add ways to gather more data for production servers, increasing visibility to those problems and allow us to diagnose these and similar issues, and the main thing is to other ruby heap fragmentation metrics. As I already said, we have found that the increase in puma memory is primarily due to ruby heap fragmentation.

A

That means that the number of non-empty heap phases keeps on increasing over time and that memory can never be returned to the operating system, so we want to add metrics and expose them to prometheus so that we can use them to further investigate and analyze this problem.

A

The third part, the third part that we want to work on, is to decide how to deal with resource allocation of various environments. So in gitlab we have what we call the puma of the killer. This is a piece of rubicon code that runs in the background as a thread which reaps what the processes, if they run over a given memory batches.

A

So while we were running this investigation, first of all, we realized that this was not running correctly in gitlab.com, due to a configuration, so we fixed that. But while doing so, we started a larger discussion, does it even make sense to run these type of killers in a resource-controlled environment like bernettis, so in kubernetes we define container and other resource batches anyway through kubernetes, so should we turn it off and allow kubernetes do its job, and this is true for the puma worker killer and we have a similar issue. We discussed about the site worker.

A

There are a few pros and cons and we're going to keep the discussion and make a decision during 50.2.

A

Finally, we want to make sure and make a decision on how to set resource limits on all other environments, so, for example, on omnibus. So, for example, we have a max memory that each puma worker can use and those limits most of the times. We we define them, those are hardcoded, so they have a default and we set it and that's it and we set those limits using our reference, architectures and they're trying to take to cover most cases. But not all environments are the same, so sometimes those limits can be too low.

A

It can cause too many problems and they have caused issues with some gitlab instances in the past, where the the memory killer keeps on killing puma workers too often, and that's something we don't want to happen so we're going to discuss how to dynamically set those limits, instead of just increasing them depending on the environment.

A

That each gitlab instance runs in. The second priority is to support the the effort for fixed compliance in gitlab. This is an effort initiative that we're running throughout the whole gitlab for a few months since now, and the core request there is to that all communications should be secure and during 15.0 and 15.1, both the memory group and other groups have addressed a lot of uh of those cases. The last thing that remains for us is to add tls security for the dedicated metric servers.

A

So those are the metrics and points that are scrapped by prometheus uh to send all the metrics. So we need to add support for tls. There are two types of metrics exporters, the the ones that are inside the rails, application the git libraries application, those are inside puma and we have covered that by enabling support for tls improvement general, and then there are the dedicated server and points for puma inside kit, where this is the the last main thing that we're working on which, where we also want to to enable tls security.

A

Our third priority is to create create customer sales for global search global services. This amazing feature we have in gitlab that allows us to search over almost everything that is searchable so at the moment- uh and this is so, let's say memory.

A

So if I search for memory here, I can search over epics and code and issues and merge requests. I can check, for example, reference to memory on all all over the place inside a gitlab or the the github organization, or I can go to merger os and whatever so. The problem here is that at the moment we gather those metrics in aggregate. So all those types of searches are accounted as one type of metric.

A

We want to increase our visibility and one we want to differentiate between those types of searches and have a different matrix and gather different metrics inside the application, expose them to different promiscuous, metrics and finally build slices and slows for those metrics. The idea there is that we're going to differentiate between basic and advanced search, because it was completely different if advanced search is a a premium feature if it is enabled that means that the queries go to elasticsearch instead of postgres sql and those are completely different.

A

Those have different characteristics and, of course, to differentiate by scope where, for example, searching over merged requests is completely different than searching over commits, for example, which uh goes over.

A

Those are git commits, so in 15.2 we will continue our work on adding those custom, slice and sales, and our final top priority is to revisit our work on optimizing workers that consume a lot of memory and cause out of memory kills. So in the past we have investigated some issues who have found that there are a few workers that consume a lot of memory more than some of them more than one gigabyte of memory and even two or three of them more than five gigabytes of memory.

A

In some cases, and I'm talking about one worker, one job run for this specific worker in general. We don't want any workers to go above 100 or 200 megabytes, so all those workers are problematic. So in the past we have addressed the most hungry ones and the the reasons the fixes were. For example, we had some issues with um parsing of coverage reports that was taking too much memory and we uh we reduced by more than 80 percent or while sending the email notifications uh when generating the email from the templates.

A

If the notification had tens or hundreds of comments uh reported at the same time, that will consume a lot of memory or we have fixed other problems with them class. One queries on the on some controllers. We want to go back and investigate other uh similar workers and optimize them. So that's it for 50.2.

A

Thank you for watching and talk to you next time.