GitLab Memory Team, 19 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020 11 19 Memory Team 2GB Sync

Description

0:20 Looked into Puma 5.0 Experimental features
- nakayoshi_fork https://gitlab.com/gitlab-org/memory-team/memory-team-2gb-week/-/issues/6#note_450184965
- Fork_worker
11:52 Outcome - single mode is better
- Trade-offs
15:00 Puma 5.0 discussions
18:00 Reforking discussions
- Is there a way to have the refork benefit without having to refork
21:42 Looked at 20RPS in 2GB
28:00 Deduplication
36:20 Heap usage of Puma process
- Heapy
39:15 pmap Puma single mode
41:15 Understand what calls malloc and how much memory allocated
- Looked at jemalloc and jeprof
47:00 Benchmark settings discussion
- The tested settings are super aggressive
The settings still need to be tuned

A

There you go, who wants to start.

B

A

C

B

I could I could give a quick update, so let me share my screen, find the doc so yeah like yeah okay, so I looked in puma 5-0 experimental features suggested by camille, so there were two of them applicable to our like memory situation, memory story and first was nakayashi fork.

B

The idea is, I think, camille doing similar stuff without nakayashi for create. The idea is to run gc couple of times before working into worker, on the master inst on the master process and then doing compact and in the end like uh it was giving me some advancement and memory like I was looking into 90 megabytes, sometimes less, depending on like experiment I was running to and it's still it's still worse than single mod.

B

So, like my my outcome out of this experimental feature that single mod is probably the way we will will be going to because like, but what what is good about nakayasha fork. It seems like safe enough sorry, matthias.

D

Okay, we are recording right.

B

D

B

D

A

Good yeah, so by the way alexa, um I I did the same thing right, you're aware of this right just so because we said we want to make sure we're not overlapping our work.

B

No, but I not not aware what do you mean you did the same thing.

A

Okay, well, we talked about. Yesterday I mean I looked into compaction right for for a whole day,.

B

Yes, yes, I did the same thing like.

A

I looked into um like uh just before we fork into puma workers how what what does it buy us if we run a compaction, includes two major.

B

A

Cycles, actually so, yes, you're, seeing and compacting right now is the same thing.

B

ah You're talking about this, yes, I know this, I I wasn't doing like any deep research on the casual fork. I just enable it measure it and, as I mentioned like you and camille, you are doing the same, but let's say manually, but this one is included into like setting of puma.

A

So like it's like a special version of.

B

Yeah yeah, no, no, not special version, just a setting you enable it set nakayashi for true and on puma five zero. It starts doing the thing, so it gives it it gives an advancement like expected, but worse than single, I also looked into fork worker. The idea is to not use use master process in traditional way, but work from a first booted worker and then maybe refork from zero worker into another workers.

B

So it also was like minor, minor difference in my experiments. I was running it. I don't warm up gpt. It was like the difference comparable to amount of uh uss memory of the process like less than 100 megabytes. So I didn't really look into hip because I wasn't able to get to it yet so my thoughts after it that.

B

I think we we will stay with single mode because, like.

A

B

A

A question about the reforking. So what what is the idea here is the idea that you warm up the worker with some requests and then fork because, like if you refork straight away, then I don't see what the point.

B

No, no, no, no first, you report straight away, but then, like after running, like let's, let's say, hitting 1000 requests on zero worker, you will kill every other worker except zero working and reporting. So you're doing this on a general basis like after some some amount of requests.

A

I see so so you.

B

Keep reporting well the first batch.

A

Of workers is forked from master, but then any other worker that will fork in the future is forked from an existing worker. No.

B

Kinda but they call it a bit different. You see there is this picture, so you basically load first worker and working zero worker and then working from it, the first worker and then these are most heavy workers. So you are not using master in traditional way, you kind of starting to work from zero worker, and then you keep killing all other.

A

Workers, after what is this, what is the zero worker? What's special about the zero marker.

B

It's basically a master, it's basically.

A

Yeah, so what's the difference then to just from master, I.

B

Don't know they call it zero workers, they they don't use. Let's say master in this game.

B

A

Okay, but I mean there must be a reason why it's suggested to do that right. So what is the selling point here.

B

um I think I think, basically use one less process. The selling point how I.

A

See it doesn't it, you know what process, because you have master and the zero worker now.

B

Yeah but you're not you're, not loading, every single must right. I think I believe- maybe I don't understand this, but I believe master is just for orchestration here. It doesn't uh like doing the same as mastering. Let's say typical master plus worker setup, like in typical master plus worker setup master, is loading whole application right. Then we fork into workers like, but here master is very lightweight. So while I was observing it was 40 megabyte stocks. So all the preloading happened in zero worker, so master wasn't doing anything. It was just orchestrating.

A

B

So zero worker was our kinder master and then report from it. But after some time we could kill our other workers and again reform from zero worker. So there is obvious gain. We don't have heavy mastery anymore and but, as I said,.

A

Oh, I see actually sorry like it says it down here. What the benefit is, and I think that's why it doesn't benefit us is because it says here this command. Yes, yes, improve memory. If you don't fully print pre-initialize, which is what we do.

B

We already just start very heavy. We already start very happy very fast, so I I didn't see.

A

So I suspect in this setting and this when you enable this config, it probably does not pre-load the app. That's why your master is so.

B

No, no, it doesn't reloads up yeah.

A

B

A

B

E

B

Okay, like like this master, does not peel off; no, no, it doesn't have pillows up it has. Let me check it.

A

I mean if it would, then there would be no.

B

No difference no.

A

No, it doesn't reload.

B

Yeah, as you could see it, basically no it preloads up but into zero worker as far as the understand. So um let me check similar yeah. So what I see what I see even before teaching the application with the first request, unless somebody else is hitting the application. His first request on the background, I saw that zero worker is become quite heavy.

A

Yeah, I mean so I think I understand now so what they're saying is, if so, what we do currently in the traditional approach is we preload the app on master, but the master.

B

Never serves any traffic.

A

Right so the master is basically idle outside of.

B

A

So when you incorporate new or like right, so if you're starting a new worker, you always fork from a fresh from like a clean slate right. You always for a new worker in this model uh will have never have pages loaded uh that you might have to load to actually service a particular request right, whereas in this model uh the master.

B

Does not pre-load the.

A

App so it remains super lightweight and it still only does the orchestration work, but this zero worker is an ordinary worker right. It serves requests so.

B

Yeah yeah, it could have more information that we could.

A

Yes, you will have, it will be more representative of what the.

D

A

Will do, but um it will be much less useful for those applications that, like preload the whole app anyway right, because.

C

Yeah, it's actually.

B

A

I think we never really. I don't know camilo. If you ever looked at this in detail because pre-loading the app I don't actually know what exactly that entails, because clearly it does not.

A

um It does not mean that everything that could be in memory up front is actually in memory up front right, because we can see that by just if I go ahead and like just after forking, if I, if the first worker or the first batch of workers goes live, and I send a single page request right to the to the front page, it will start shoveling more data in memory right. So.

D

A

D

I think like I wanted to check that if it's actually are we loading more sources afterwards but, like I think they are like like we need to be aware of like two things like there is like an actual loading of the application.

D

I mean the code in the code base and like the classes, but there is also like the initialization of the application that happens on the first request, like we lazy in it some of the variables some of the configs and things like that and that's gonna happen like on the first request. Basically.

E

Like but you're not talking about initializers right.

A

E

A

E

Talking about initializes right! Well, what? What kind of improvisation I'm talking that we have parts of the codebase uh that.

D

Are blazingly in it like strong common minds for.

D

Yeah, basically, it's basically lazy initializing and I'm just curious. Like do we load like some pieces of the code because of that that are not preloaded in like the regular way and like our upper load, that when you request the part of the classes that are not yet referenced uh and the second one is like how many the strong memorize happens that may like impact our memory structure, that gonna like impact copy and write as well.

D

So I guess this is what happens after the actual preloading, and this is why the this real world fork actually is beneficial, because it ensures that we hit these parts of the code base. Basically, as well.

A

So alexi did this. I don't know why nakayoshi fork is such a like silly name, but.

E

A

What's supposed to mean, but.

B

Like so, this is.

A

Not the same thing as this reforking model.

B

No, no, no they're, two different completely pictures, yeah and they're not compatible. As far as I know, I wasn't able to run post nakayashi fork and fork work and see the results. So maybe maybe I did something wrong, but I didn't I wasn't able to pair them up yeah. So, but like my my outcomes from this, that probably single mod is the way I mean it's much better. It's like more than 100 megabytes better than what I see of any of these experiments and yeah.

B

I don't really think I could do something more without looking at the hip at this moment like because these are just not like fully uh fully proofed experiments, because I just run it. I just observed the grass, but.

A

Is the performance the same like we because I mean we talked about it a little bit yesterday, but I feel like we're comparing apples and oranges there right, because we're saying we're basically saying well a single.

B

Progression deployment.

A

Consumes less memory than a three-node deployment, but yeah? Well, the of course it does because you may have like you know, two less processes. I mean that's, not really surprising right. So the question.

D

A

Does it perform equally.

D

Well, yes, performance will be worse right, but this is kind of like the trade-off that you are like saying. It's like. I guess like it's about like the scaring like how much cpus you expect on the two gig instance and how beneficial is using the cluster mode, which is really like for the multi cpu installs.

D

So so I guess it's really like about the rps that we intend to support.

A

Yeah, that's exactly what I'm trying.

D

To do you might be like one like more efficiently but like slower, but still like, have this treading capacity yeah yeah, I'm I. I would kind of assume that, like on the like these two gig instance, he would kind of expect to have one or two v2 rcpus.

D

Usually so I I I would say that this is mostly likely like the what you are expecting. The case of the raspberry pi's probably is different, because the raspberry pi has four vcpus.

D

I mean four cpus, which are very slow.

B

I don't know like probably like to be honest when I was looking at the gpg. It wasn't like really worse, more test failed, but it wasn't like. uh Let's say it was like two seconds instead of one second like or one point, five five, a failed test. So I found this acceptable on a single network and.

A

I think that weren't the the performance test we ran, they were a little bit unfair right because we were giving it more traffic than that kind of architecture is actually meant to support right.

B

To be honest, I started to run five sec, not like five seconds but two rps. I started to try it.

E

B

It's just faster yeah. It was just much faster for me to see under some load because my point was not to run performance test itself but to warm up, let's say some endpoints and to load some data into memory. So I was mostly running on five seconds to rps just to go through all this.

D

B

Camille, I have like a short question for this, so what's holding us from updating 215 like except the fact that I noticed, we use our fork right, our own, like gitlab,.

D

Nothing is really holding us. I mean it just came out last week right, it's like I'm sure that this works for property and, like we can't figure it out.

B

D

On this aspect, because like this, we are using our fork, because the older version didn't have like this, uh my fix to the performance, but this as an experimental feature, so you can figure that so nothing really is holding us it's more like the time to ensure that it works properly.

D

B

So I I think that uh I will probably uh I don't know pair with nikola matias, to learn more about hip today, because I think that, like even if to continue with puma, I should at this point I should start looking into the hip difference. I mean.

D

So I'm kind of like wondering because, like I, I'm kind of set on the single mode for that very small stuffs, but there is also this refork thing like. I just wonder that maybe asset phrase, this is something like for us to figure out, maybe not this week, but maybe future in the future to actually understand how much benefit it brings us when running in this cluster style.

B

Well, I, what I could do is I could set an aggressive report, let's say every 100 requests and run and run gpt and see how how the memory grows. So, if it will be at bay like much more in bay compared to let's say less aggressive report, then it gives some difference.

A

Yeah, I think the key point really is that the main benefit you will get from this or you get will get more benefit if the zero worker has is representative of what.

B

uh Yeah right, like.

A

Right, so you need to warm it up and.

D

A

Across like a number of yeah, so so I don't know, I don't.

B

Know we need to run quite long.

A

B

Maybe maybe a quite long gpt with more requests and more aggressive reworking like even if it like hits performance just to see how it affects memory. I could run it really fast.

A

It's also it I don't know how the puma authors would see this being used so because I keep thinking like for conferences for us right, we.

B

Have a lot more.

A

Options like with how we roll this out right, so we could, um because you can do like rolling deployments right where you, maybe you deploy a new worker and then you let it warm up in the background. While you have your old workers still servicing traffic and then you might switch over to the new.

B

One from which you then.

A

Fork, but I I don't know how like in environments where we don't control uh how this should run in production or where we have no idea. Basically, what kind of traffic we will service um this might be really hard to tune. I think.

D

I'm kind of like curious like how this reforking, how stable is that and like first like how much benefit it brings and like what like side side behaviors? It has on running on the production because, like I think that I get super clever, but it's also it's like kind of different and more risky compared to how we run before so.

B

Yes, they mention it that it's like much more experimental. Let's say it's a nakayoshi fork because, while nakayoshi foreign gts, this thing could cause some side effects, as I mentioned. So yes,.

D

Yes and like nakayashi folk is pretty much safe, really yeah. I guess because, like it's pretty straightforward, but I'm just kind of wondering if there is like some way for us to have the reform benefits without doing quick fork so like in the current model yeah- and this is like what I'm like- also interested about like.

B

Maybe, on the hook.

D

This is what I'm kind of like trying to like to sit through like this, what is being loaded afterwards working initialize afterwards? What happens between this time? That, like you, fork and you process a single request and like there is this spike in the memory consumption like what operations do happen during that time? Can we somehow make it happen before how much it would bring a benefit? Because if we can somehow like make this work without the reforking, it would actually be a probably much safer way to implement, and maybe we could save like another.

D

I don't know 50 megabytes.

B

So, ideally, you want to master before, firstly forking into worker, to do some job right like to imitate something like either loading additional code, which will happen anyway in first request or serve something right, and then they start working towards. Yes,.

D

So I'm kind of really curious, like what happens and actually like this can be done like with the execution profile, probably what happens between like our fork and like processing the request, pretty much like any request, and actually I'm kind of now, thinking that we could really like do uh like execution profile out of that what is being executed, I mean I mean the frame graph, maybe but the second one, like execution profile, what methods are being hit during that time, like of between like fork when you fork into a new process and like then, you execute like the first.

B

Request. Okay, so maybe something to open here.

B

Let me add this to the top, so I think I I finished with my update, so I just that's what camille mentioned into the dog.

B

I don't know where it.

B

B

Okay, so where's the next camille.

D

Sure so I I guess, like my all items that I look out, um I look at the that 20 rps. It's like pretty nasty for running in the 2g, because one of the reasons that I I kind of like got to understanding like why. We, I see like pretty pretty nice, like the puma and sideki memory usage over time, but like a very steep increase in the swapping at some point, and I noticed that it's due to like to get processes running from the guitar.

D

It's basically spin ups, like 20 30 git processes because of the unlimited concurrency, which kind of leads me to the to the conclusion that in this small instance, we should limit, like the git concurrency as well, to ensure that we control like how much is being spin up and why it is bad, because these git processes are in the gpt. They are for, like the big repo and they consume, like a ton of memories.

D

Memories being started like because, like the git repo is a mmap, but it's still uh pretty like uh evicting to the application, but.

A

I thought is this because of rugged, because I thought we used giddaly, which.

E

Should be single, proc right and just use? Oh.

A

E

Like like, italy has like actually has three ways of accessing.

D

People- okay, am I getting data from the repo like I'm not talking about the rugged rugged is something to kill at the soonest possible moment. I'm talking about the guitar guitar. You can open repo directly in the git pro in the go process using clip gt2 and they implement some of the functions. They have legacy way of accessing git repo, which is from through italy, ruby, a site process that is running that they communicate with this italy.

D

Ruby also opens, but in the ruby context, the repo using clip g2- and there is also for some commands that are hard to implement with libgh we just sell out to the jit command.

D

To be like, for the some cases for some requests in this 20 rps to be like pretty evicting, because we're gonna spin up a ton of git processes, basically and.

A

It sounds like the image scale problem all over again. Remember because we had to put uh kind of a rate limit in place for the image scaler as well, because we just fork on request.

D

Yeah, so so it's actually the same problem I mean literally, I lost like to emit that, but by default it's unlimited. So I'm kind of thinking that if you would limit that to like to some same number, maybe two or four or this two gig, it would kind of, like let's say, behave properly. I.

A

Mean the problem there is with image scaling. We have a clean, clean, failover path, because we can always go back to serve the original image. But what is the failover path with git you? You can't just drop the request on the floor right because you still need to.

E

E

D

For some time, which is like the deadline to find and like if, if you, if you like, go beyond this deadline, it's going to basically fail.

E

So it's a queueing problem then, which comes with its own problems. Yes, but like.

D

But like I guess, like, I think it's acceptable like to have limits how many concurrent operations.

A

D

And like it's acceptable that.

D

It's just like this small install you're gonna have like lower capacity, so maybe it's kind of like after alexi comment, like maybe 20 rps is just like over here, like we should be testing, maybe two or four rps, which is like yeah, even like the like amount.

E

Of the resources that is available for you and how much memory does each get process consumed just like roughly? Do you remember? uh It really depends on the on the repo that.

D

You're opening because they are mapping the window so like this time, it's gonna need some memory. Buffers like to do this and mapping, and also there's some others on our processing, but git is generally very efficient, but, like still a mapping requires like some.

D

Some memory like to run and if I record properly like the very big memory usage by the git process, is something that is in general, our problem, basically, that it kind of creates a spike of the memory usage on circumstances and like we have no way to emit how much memory is being used today and and and- and I think that there are some issues already for that.

D

E

Not in our tracker but.

A

uh Of the guitar yeah, thanks for sharing that I had no idea it worked that way. I I was kind of assuming like that. Oh a nice way of getting you know, everything runs single croc and apparently not it's good to know.

D

So then I look at the dc params, I think, like the defaults, and I kind of try to be very aggressive, like I noticed that, like being very aggressive, is actually beneficial for the memory usage, because it's much smaller.

D

However, I still was unable like to see the three pages like being in the in the guitar in the gitlab process, so I may continue looking at that. I noticed that on my like linings when I was running side, but I was not noticing that on the on the gitlab process, so I'm not sure yet. I need to investigate exactly like if there are, if it's due to the pages that maybe are fragmented, fragmented or maybe it's due to some other reasons. I don't know that about that.

D

Yet uh I also like which is kind of like I think it's very close to like what matthias did.

D

uh I kind of like look at the each object of the string and like look at the frozen ones, unfrozen and one that can be deduplicated, and I noticed that like we can really like to duplicate about 25 megabytes of the strings, basically in in the ruby processes of github, which seems like pretty big number. I just didn't yet found like the easy way to duplicate everything like I can iterate some objects and like that appliques in these objects, but some objects are frozen, so it makes it impossible to duplicate them.

D

I mean I look at the ruby gc code, but it's pretty complex. It's like 12 000 likes of the code.

D

It's not really straightforward, so I didn't found a way like to, let's say, update all the memory, references of the all underlying structures to perform this kind of the application in the in the native, I'm kind of like thinking that in general, like the uh gc, the application of the strings could be really like a nice addition to the ruby because, like it's actually free memory to reclaim without any performance benefit without any impact on the application running at all and like it actually is kind of line irani with the big application, the bigger the application is and like mathias.

D

Like you, you said that, like we are locating how much like of the of the ruby memory, something like.

E

D

Max or something.

E

Of the ruby hip or is it more.

A

It kind of depends on what you look at that's the real main thing that kept confusing us. So if I, if you sum up uh so I looked at two things like that was two days ago or so, which was summing up, uh uh object, space each object and that came down to, I think 190 megabyte for me. And then I look at the gc stats as well.

A

And if you, if you they can sum up uh the number of allocated pages uh which might not be full right so so that doesn't actually so that's kind of an upper bound. So it's an upper bound for the current eu, but it's actually representative of or should be representative of rss. uh So if you, if you multiply that with the page size which is 16k, I also got around that number like 150 to 190 megabytes, which.

B

A

uh Well below um the actual memory used, but if you look at reports from, I think what was it was it derailed? I think nikola you spent some time with derail like was reporting, 900 megabyte used, or so so it must be looking at. I don't know yet what it does, but it must be looking at other stuff as well so and hippie. I think as well was um it was um considering.

A

I don't know like what they we can maybe look at what they do.

C

Yeah rss for hip is showing 800 megabits.

A

Yeah, that sounds that makes total sense, because that's yeah what we were seeing like idle memory for the.

D

Worker, uh so then, I also like um look at this instrumentation. It was like, like um commenting like one line of the code basically- and it kind of gave me like the some idea that just removing that alone should give us probably about half a mac uh of like reduced memory usage like basically for free. I also look at the prometus, client and, and I kind of remembered, exactly how it is behaving and how it's implemented, and basically prometheus client can also quite significantly increase memory pressure over application run, and I'm just I'm sure.

D

Maybe this is also one of the reasons why this usage grows over time, and why, like it, increases the initial memory usage because, like uh different types of the metrics like histogram counters, gorgeous with different aggregation types, each of them need like an individual and a map file, basically, which is owned by the process and the current mmap minimal and map size is like for max.

D

So it means that we may not use the the old pages of that, but it means that over execution time, if we gather a lot of metrics, we're gonna be actually kind of linearly increasing amount of the map memory and we might be kind of requiring a significant space on memory buffers which kind of reduces the amount of the memory available to like to the process itself.

D

I I didn't get exact number like after disabling prometheus, but I think I saw like something around 60 or 80 megabytes less when disabling prometheus metrics. So I guess this is like another venue for us like to really investigate uh our prometheus integration, because I think it can be pretty big. Think uh of the memory usage in the processes.

A

There was an outage even right, a couple months ago, where um we were, we were running out of memory in production and they tracked like there was a super interesting. I think I floated in the channel. Back then, like stan, did a super interesting like debugging session, where they looked at a bra like a core dump, basically of what was going on in one of these processes, and it turned out. There were like massive strings of um prometheus metrics.

A

We were trying to push out through the exporters which consumed a ton of of memory, and they they made a change like they were. I forgot the exact reason why these strings were so large, but apparently there was like a bunch of information in that we didn't actually need to export, so they fixed it that way, but maybe there's still, you know more opportunity to to look at. So maybe that's actually maybe that's exactly the same reason you're, seeing that I'm.

D

I'm I'm not sure about like like, for example, if you have like very long like well, I don't think that we really sanitize the label length, so it's very likely that you may have very long clay bars, which is one thing, but the second thing is like the way how it's written is just it's kind of like linear growth over time and it's like not really optimized.

D

So I guess this is something really like for us to look at and figure out like how how it can be done better, more efficient or or anything like, because, like I, I think, like we are kind of underlining of the scaling of the coming prometheus matrix like we can like even scraping of these prometheus matrix takes. I don't know on the github.com something around five seconds. Basically, so it's pretty it's pretty extensive. It's pretty.

D

I mean it's, it's very, very optimal in the way how it's written, but I think, due to the amount of the matrix that we have, we need to figure out something basic, basically better different approach to this problem, to ensure that we don't hold that in the process.

D

Memory, uh we maybe somehow accumulate that matrix somewhere else, I'm not sure like how to approach that, but I think that this can be really a big thing of the memorial usage that increases over time and that, like we basically store like multiple matches per single controller as well. So um I I think this could be also one of the reasons why we see, like pretty, let's say reasonable usage of the memory on the ruby process. But there is a lot of other memory leaking outside. I mean nothing.

D

It's kind of like being mallocked or like done outside of the of the ruby and prometheus client today is written natively in the sea. So it's a mapping, it's pretty efficient, but like it's not you cannot recycle that memory. That's the problem as well, uh and it actually has pretty severe consequences for for, like the long-running processes where, like it, increases memory like kind of like into infinite.

C

A

Yeah, I'm cool thanks, yeah, let's uh I guess I can do it together with nikola, because we work together most of the day um yeah. So what we we wanted to drill more into the heap usage of the puma process in general like what's going on, and um so what we spent most of the time on was um pulling looking at a heap dump generated with uh object.

A

Space dump all so this actually gives you a full account that even includes like memos- and you know, internal ruby objects like vm objects, so it's fairly complete and also, if you look at the cumulative size of these reports, they are much much closer to what we were seeing in terms of our rss, not.

B

Always right there.

A

Were some discrepancies because it also depends it's a bit tricky because you need to trace allocations first and depending on when you start tracing allocations, you might not actually catch. You know all the requires or whatever uh you basically need to do it like before your app does anything. uh So so it's not it wasn't super easy uh to yeah gotta get a full account, um but anyway, so uh we used so by default. This is just a json dump like a json blob.

A

Basically uh it's fairly readable, but it's like not aggregated, so there's a tool called hippie um that we used to run over that data, and it gives you like a interesting breakdown of like uh yeah like where memory is allocated. It can trace it back to files uh as well. So that's all updated in the issue, um and uh we then so I I so it yeah it took a while to get there and to actually get this.

A

I think today we need to spend a bit more time and actually like looking at this data and looking at all these individual reports, because there's a ton like nicola.

B

Updated it all.

A

In the issue, there's like a dozen different sections that we can, we can try to wrap our heads around uh and uh we then proceeded to.

A

um So one thing we observed was uh that there is a big chunk of memory uh that is reported as just anonymous pages, so it's just like in ram uh pages that are not backed by files right so and it we don't really know what's in there, and uh one thing we wanted to find out was do like to what extend to um like things like shared libraries and like yeah like external code. um That is not ruby, uh contribute to this.

A

uh So one thing we did was we looked at um the memory map map, mapping data for uh running uh puma worker, and that is summarized in a spreadsheet that actually, maybe I can quickly show that um just super quickly. I.

E

Mean that's not, um unfortunately, that is like.

A

Not um a big takeaway, but maybe it's interesting to see.

E

So, like materials, you are saying that, like we have like a little of the memory marked files or something like that,.

A

Yeah, the a very tiny amount only so so this is the raw data from from by the way we ended up running, we ended up changing the uh uh so this came from the vmware omnibus, but we changed it to single mode, because it was quite difficult to reason about, like page sharing all that, if you have three different processes to look at so we wanted to see first uh yeah just in terms of uh how the memory, what the memory layout is, we wanted to look at a single process, so this is not from a single node, uh a single worker uh puma, um and you can see this- is I just created a pivot table based by, or maybe I should say first so this mapping will tell you so blank basically means it's not mapped by a it's not backed by a file.

A

So it's not it's just um yeah directly map memory. That is just in in memory for that in ram for that process, and um this is the aggregation by the um mapping- it's not always a file, uh so so this blank here is basically the sum of all the memory that is not file backed. uh They get all ended up kind of up here, that's just the total, uh but yeah. You can see that the vast majority like is um not backed by by a file, but I think.

B

It was still an interesting.

A

Exercise to see uh what kind of contributes to memory, but all of this is like quite tiny, so it probably doesn't make sense to to look at external libraries but yeah. You.

D

A

Have a look at it, uh it might be interesting to see just like what uh there's.

B

A

Stuff actually that when we um uh map into uh memory that is not specifically ruby code, so I thought it was interesting to look at um what else um yeah I mean that's pretty oh yeah. We also, then, were wondering okay. um If, uh if there's some discrepancies between like what object space tells us is being allocated on the heap, maybe there's a ton of like melodics going on that are not reflected in these reports. So we wanted to see um maybe there's a way to understand what calls malloc or from where do.

A

We call malloc and um how much memory is allocated that way.

A

We have no idea if it's going to lead anywhere, but so we looked at je mellock again, which is the allocator that we use and it has a profiling companion tool called je prof, uh so you can actually and we compile j email up with profiling, support enabled, which is good. um That's just configured like that in omnibus already, so uh we created a profile dump. uh You can have a look at that in the issue. uh It's just like totally illegible by default because it does not.

A

um It does not uh like the the symbols like where this is being called from. It's just, I think, an address in memory, so it's unusable. But apparently, if we didn't get around to do that yesterday, there's a.

B

A

Unwind these and map them back to the actual symbol that was used. Apparently that's preserved some or help that's preserved somewhere. So I'm hoping that if we try that again with this that there's like a um uh igor fontes at the end of the day, there's like a parameter, you can run it with dasher symbols which should um reverse the um resolve. These address the raw addresses to a simple table so that we can actually see what the method I guess or the function was that we were calling that so maybe.

C

I don't know I tried running it this morning, but it's stuck. It's stuck yeah.

A

Yeah, so the problem is these heat dumps are massive right, it's just like it's a sampling profiler and you can tell it the interval it will like sample. You know every zone, so many like kilobytes or megabytes or whatever you tell it, but uh it's quite large. You know it produces large files and a lot I think the first heat dump we took was like, like 20 files or something yeah, uh so that that was that was yesterday. um So uh well, okay, we can talk about two today later.

A

Well, I guess that wasn't right, so we can talk about today. Should I just start, then I then someone else can take over.

D

Yeah, so what I was thinking.

A

Was is I I still like would like to understand like where all these big differences come from. I think it would help us maybe to understand uh yeah in what sense, hippie and derail see seem to give us a more complete account of the memory uh spent. So so so what is that like? What is that difference between this and what we were seeing when we were looking at gc stats and object space results generally uh separately? I think that that would work. uh You said you tried this already.

A

Do you think it's not worth spending more time on nicola.

C

No, I I still think that we need to figure it out like and maybe detect the potential offenders so.

A

C

Yeah I mean, maybe you can time box that maybe.

A

C

I still suspect it will be.

A

Difficult to work with the result of this, but I mean who knows um yeah then I mentioned already. I think so. We have a lot of data collected, but uh I don't know about unicolor, but I haven't spent a lot of time yet on actually looking at all this stuff, because there's a lot now, we have all these uh derailed breakdowns and I think there's like the hippie breakdown somewhere as well, which could be interesting.

A

So I think it would be good to start maybe summarizing somewhere, because we have a lot of these like raw data dumps in this issue, uh and it it takes more and more time to scan through all this stuff and it's a bit overwhelming.

B

I think yeah, I should start kind of wrapping this up a bit and.

A

Try to draw conclusions and highlight the interesting bits, maybe in a separate place even like that I don't know um so I think that's some something we should spend some time on uh yeah, that's all I can think of for now.

C

Yeah, I agree with you, like I don't know with the gmat uh dumps. I will try just to fix the library and see why it's stuck. So I will really time box it to see if we get some readable data, but I will really time box it. But in that issue, as you said, we collected like a lot of data from different tools like uh so we should start like trying to get some sense from those data and see.

C

Can we do something about it or because I think we collected a lot of things like there are a lot of lot of lines that that could be expanded in that issue and we didn't spend.

B

C

Time off like trying to understand those results, so it's something that I will try today.

A

Yeah, I mean it's good that we have a ton of data.

C

So we can probably split somehow later yeah.

A

Okay, that means.

A

All right, I'm gonna, stop sharing my screen.

A

uh Who's next just about well, I think that.

C

I don't have much to add to the matthias update.

A

Okay, so no, I was thinking because we did we go through the today's stuff already.

B

um Yeah but I have a question so I wonder: should we use like canils settings already in running some benchmarks and so on.

D

We are super aggressive, really, sorry, you don't think it works. I I I I think like it shows one aspect like the things are not the best and like we need to find settings that are tuned to github, but my settings are not tuned to to github. They actually okay performance issues. It's rather like.

D

I think this is like the point for the future like we can gain a lot if we tune the dc settings. Basically, I just don't know how all the effects are there got it.

D

So, like the only, I guess thing that we may use is like puma singular, which is like I, I think, like predictable.

D

Maybe, like uh I don't know, disabling prometheus prometheu matrix instrumentation is also like kind of predictable, like we lose some observability but, like uh I mean like it kind of reduces memory research to kind of, like I mean like if you disable that materials and nikola like you, should see slightly different pattern, but then you kind of know that this is due to the prometheus, basically disabled. So uh this this can like remove remove some noise from your testing as well.

B

I also found an article on the duplication, so I'm curious by some saffron, but he mentioned some limitation like, for example, if we uh get like a method on a string, it creates a bar in ruby and it won't allow us to duplicate. Do we have the same limitation like in what you camel mentioned, and also he mentioned that, uh for example, painted strings which I didn't like have experience with, so he mentioned that there are a lot of duplicated strings which are painted.

B

So is that kind of in the same area you're you were talking about? Is it different.

D

Yeah so actually like. If you look at the issue that I I created and I posted some results, I there are actually like plenty of less strings. That would be the dark and like like the strings that have eye bar assigned. There is like just a few of them, there's a bunch of unfrozen strings and we could probably freeze these strings like by by looking at the libraries that don't freeze them and like contributing which could result in another, probably around 10 megabytes of the reduction of the number of strings.

D

So actually, uh I think like if we would approach that, like from complete point of view like we could be looking at 30 megabytes, 35 megabytes in total per process.

D

Basically I mean for puma and for side king, basically so, like probably around 60, to 70 megabytes of the reduction of just the string state application.

C

And what are those unfrozen objects to save megabytes? I didn't look at the script.

D

Yeah, so um frozen objects are the objects that are not frozen. I mean frozen no.

C

I I know that, but there.

D

C

Some and to save megabytes.

D

I'm kind of assuming that parts of the libraries that we use don't use the frost stream.

D

This is the reason I'm not 100, confident of that, but I'm assuming that a bunch of these strings can be frozen. Maybe half of them- I don't know how much of them but like there is like 22 megabytes that can be removed of the existing frozen strings, basically like just doing an extension to ruby gc. Basically, nothing.

D

Else, okay, it seems like I don't know, probably like. If you contribute to the ruby ruby this, it will be a piece of code, probably something between 100 to 500 lines of the c code, doing some magic and that would that application would reduce about 50 megabytes without any application changes. Basically, just ruby run time change.

D

I mean, as for me, like I kind of like continue like with these investigations, like uh I I'm not sure like where it's gonna like lead me, but I'm just still curious about. Maybe I finally look at this loading of the of the application, and I understand what is happening.

D

C

B

Okay, should we stop the recording yeah.